In the previous post we saw how JIT inlining works. We also saw how the JVM performs OSR to replace the interpreted version of the method to the compiled version on the fly. In this post we’ll dig even deeper and see the assembly code that is generated when the method gets compiled.
Prerequisites
The flag which enables us to see assembly code is -XX:+PrintAssembly
. However, viewing assembly code does not work out of the box. You’ll need to have the disassembler on your path. You’ll need to get hsdis
(HotSpot Disassembler) and build it for your system. There’s a prebuilt version available for Mac and that’s the one I am going to use.
1 | git clone https://github.com/liuzhengyang/hsdis.git |
Once we have that, we’ll add it to LD_LIBRARY_PATH
.
1 | export LD_LIBRARY_PATH=./hsdis/build/macosx-amd64 |
Now we’re all set to see how JVM generates assembly code.
Printing assembly code
We’ll reuse the same inlining code from last time:
1 | public class Inline { |
-XX:+PrintAssembly
is a diagnostic flag so we’ll need to unlock JVM’s disgnostic options first. Here’s how:
1 | java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly Inline 100000 |
This will generate a lot of assembly code. We will, however, look at the assembly code generated for inline1
.
1 | Decoding compiled method 0x000000010d6095d0: |
So this is the assembly code that we get when we run the program. It’s a lot to grok in one go so let’s break it down.
Line #7 and #8 are self explanatory; they show which method we’re looking at. Line #9 and #10 (and #13 to #17) are for thread synchronization. The JVM can get rid of thread synchronization if it sees that there is no need for it (lock eliding) but since we are using static
methods here, it needs to add code for synchronization. It doesn’t know that we only have only one thread running.
Our actual program is on line #11 where we are moving the value 4
to %eax
register. This is the register which holds, by convention, the return value for our methods. This shows that the JVM has optimized our code. Our call chain was inline1
⟶ inline2
⟶ inline3
and it was inline3
which returned 4
. However, JVM is smart enough to see that these method calls are superfluous and decided to get rid of them. Very nifty!
Line #21 to #23 has code to handle exceptions. We know there won’t be any exceptions but the JVM doesn’t so it has to be prepared to deal with that.
And finally, there’s code to deoptimize. In addition to static optimizations, there are some optimizations that the JVM makes which are speculative. This means that the JVM generates assembly code expecting things to go a certain way after it has profiled the interpreted code. However, if the speculation is wrong, the JVM can go back to running the interpreted version.
Which flags control compilation?
-XX:CompileThreshold
is the flag which controls the number of call / branch invocations after which the JVM compiles bytecodes to assembly. You can use -XX:+PrintFlagsFinal
to see the value. By default it is 10000
.
Compiling a method to assembly depends on two factors: the number of times that method has been invoked (method entry counter) and the number of times a loop has been executed (back-edge counter). Once the sum of the two counters is above CompileThreshold
, the method will be compiled to assembly.
Maintaining the two counters separately is very useful. If the back-edge counter alone exceeds the threshold, the JVM can compile just the loop (and not the entire method) to assembly. It will perform an OSR and start using the compiled version of the loop while the loop is executing instead of waiting for the next method invocation. When the method is invoked the next time around, it’ll use the compiled version of the code.
So since compiled code is better than interpreted code, and CompileThreshold
controls when a method will be compiled to assembly, reducing the CompileThreshold
would mean we have a lot more assembly code.
There is one advantage to reducing the CompileThreshold
- it will reduce the time taken for the branches / methods to be deemed hot i.e. reduce the JVM warmup time.
In older JDKs, there was another reason to reduce CompileThreshold
. The method entry and back-edge counters would decay at every safepoint. This would mean that some methods would not compile to assembly since the counters kept decaying. These are the “lukewarm” methods that never became hot. With JDK 8+, the counters no longer decay at safepoints so there won’t be any lukewarm methods.
In addition, JDK 8+ come with tiered compilation enabled and the CompileThreshold
is ignored. The idea of there being a “compile threshold”, though, does not change. I’m defering the topic of tiered compilation for the sake of simplicity.
Where is the compiled code stored?
The compiled code is stored in JVM’s code cache. As more methods become hot, the cache starts to get filled. Once the cache is filled, the JVM can no longer compile anything to assembly and will resort to purely interpreteting the bytecodes.
The size of code cache is platform dependent.
Also, JVM ensures that the access to cache is optimized. The hlt
instructions in the assembly code exist for aligning the addresses. It is much more efficient for the CPU to read from even addresses than it is to read from odd addresses in memory. The hlt
instructions ensure that the code is at an even address in memory.
Which flags control code cache size?
There are two flags which are important in setting the code cache size - InitialCodeCacheSize
and ReservedCodeCacheSize
. The first flag indicates the code cache size the JVM will start with and the latter indicates the size to which the code cache can grow. With JDK 8+, ReservedCodeCacheSize
is large enough so you don’t need to set it explicitly. On my machine it is 240
MB (5x what it is for Java 7, 48
MB).
Conclusion
The JVM compiles hot code to assembly and stores it at even addresses in it’s code cache for faster access. Executing assembly code is much more efficient than interpreting the bytecodes. You don’t really need to look at the assembly code generated everyday but knowing what is generated as your code executes gives you an insight into what the JVM does to make your code run faster.