In my previous post, I mentioned that escape analysis is available in mustang releases and debug flags available for it in HotSpot(tm) VM.
I ran some micro benchmarks yesterday, they are fairly trivial but good enough to identify the performance gains, and here is what I found:
import java.io.ByteArrayOutputStream;
import java.io.OutputStream;
import java.io.PrintStream;
public class TestEA {
public static class MyPrintStream extends PrintStream {
public MyPrintStream(OutputStream out) {
super(out);
}
@Override
public void println(int cnt) {/* deep inline candidate */
}
}
private static final int COUNT = 10000000;
public static void main(String[] args) throws Exception {
System.setOut(new MyPrintStream(new ByteArrayOutputStream()));
for (int i = 0; i <= 10; i++)
test();
}
private static void test() {
int cnt = 0;
Object objLock = new Object();
long start = System.currentTimeMillis();
for (int i = 0; i < COUNT; i++) {
synchronized (objLock) {
cnt++;
}
consume(cnt);
}
long end = System.currentTimeMillis();
System.err.println(cnt + ", time=" + (end - start) / 1000.0);
}
private static void consume(int cnt) {
System.out.println(cnt);
}
}
The results
VMArgs: -server
10000000, time=0.907
10000000, time=0.938
10000000, time=0.969
10000000, time=0.906
10000000, time=0.906
With
VMArgs: -server -XX+DoEscapeAnalysis
10000000, time=0.89
10000000, time=0.922
10000000, time=0.016
10000000, time=0.016
10000000, time=0.015
10000000, time=0.016
JVMOut:
28 JavaObject NoEscape [[ 54F]] 28 Allocate ....
40 LocalVar NoEscape [[ 28P]] 40 Proj ....
90 LocalVar NoEscape [[ 28P]] 90 Phi ....
213 JavaObject NoEscape [[]] 213 Allocate ....
225 LocalVar NoEscape [[ 213P]] 225 Proj ....
======== Connection graph for TestEscapeAnalysis::test ....
172 JavaObject NoEscape [[]] 172 Allocate ....
184 LocalVar NoEscape [[ 172P]] 184 Proj ....
I ran the loop in main for few times because server VM does a deep inline and it might optimize the execution considering cnt is not significantly used anywhere.
Well, It can be seen that escape analysis successfully eliminated synchronization on objLock (lock elision) . As I posted earlier, synchronization has significant impact on execution speed, elimination of this heavy operation improved the speed substantially. Consider the effect of it on a highly concurrent web server (JAWS) handling hundreds of simultaneous requests. Of course, it happened because objLock is allocated locally and VM identified that it can't be shared between multiple threads and its safe to remove the sync overhead.
I also tried making objLock a static field of the class and found that escape analysis has no positive impact on execution speed as can be seen here:
VMArgs: -server -XX:+DoEscapeAnalysis (objLock as static field in above program)
10000000, time=0.922
10000000, time=0.922
10000000, time=0.906
10000000, time=0.891
10000000, time=0.921
As expected, objLock being a shared field, VM has no way of identifying whether lock on it can be eliminated or not.
Unfortunately, stack allocation seems to be absent in current builds of mustang (or there were no hints in debug output as when it was done. Also, its hard to decipher the VM output and I found no explanation for it). Here is an excellent presentation on new optimization in HotSpot(tm) VM. And for lock elimination enhancements refer to this.
So, there you go, one more optimization for the managed runtime; for those who are still in the illusion that native static compiler optimized programs are the fastest, think again, your programs are static at runtime can't organize itself, managed runtime is metamorphic, it can adapt and substantially optimize itself at runtime. Man, that's what I call programming.
No comments:
Post a Comment