Understanding the JVM JIT and helping it along

I must apologize to my readers. I have been remiss in my blogging duties. I will be posting updates on the various events of the past month or so along with updates on JRuby progress and future events very soon. But for now, a technical divergence after a night of hacking.

--

I finally understand what we should be going for in our compiled code, and how we can really kick JRuby into the next level of performance.

The JVM, at least in HotSpot, gets a lot of its performance from its ability to inline code at runtime, and ultimately compile a method plus its inlined calls as a whole down to machine code. The benefit in doing this is the ability to do compiler optimizations across a much larger call path, essentially compiling all the logic for a method and its calls (and possibly their calls, ad infinatum) into a single optimized segment of machine code.

HotSpot is able to do this in a two main ways:
  1. If it's obvious there's only ever one implementation of a given signature on a given type hierarchy
  2. If it can determine at runtime that one (or a few) implementations are the only ones ever being called
The first one allows code to be optimized fairly quickly, because HotSpot can discover early on that there's only one implementation. In general, if there's a single implementation of a given signature, it will get inlined pretty quickly.

The second one is trickier. HotSpot tracks the actual types being called against for the various calls, and eventually can come up with a best guess at the method or methods to inline. It also can include a slow path for the rare future cases where the receiver does not match the target types, and it can deoptimize later to back down optimizations when situations change, such as when a new class is loaded into the system.

So in the end, inlining is one of the most powerful optimizations. Unfortunately in JRuby (and most other dynamic language implementations on the JVM), we're making inlining difficult or impossible in the most performance-sensitive areas. I believe this is a large part of our performance woes.

Consider that all method calls against any object must pass through an implementation of IRubyObject.callMethod. There's not too many callMethod implementations, and actually now there's only one implementation of each specific signature. So callMethod gets inlined pretty fast.

Consider also that almost all method calls within callMethod are to very specific methods and will also be inlined quickly. So callMethod is looking pretty good so far.

Now we look at the last step in callMethod...DynamicMethod.call. DynamicMethod is the top-level type for all our method objects in the system. The call method has numerous implementations, all of them different. And no one implementation stands out as the most frequently called. So we're already complicating matters for HotSpot, even though we know (based on the incoming method name) exactly the piece of code we *want* to call.

Let's continue on, assuming HotSpot is smart enough to work around our half-dozen or so DynamicMethod.call implementations.

DefaultMethod is the DynamicMethod implementation for interpreted Ruby code, so it calls directly into the evaluator. So at that point, DefaultMethod.call will inline the evaluator code and that looks pretty good. But there's also the JIT located in DefaultMethod. It generates a JVM bytecode version of the Ruby code and from then on DefaultMethod calls that. Now that's certainly a good thing on one hand, since we've eliminate the interpreter, but on the other hand we've essentially made it impossible for HotSpot to inline that generated code. Why? Because we generate a Java method for every JITable Ruby method. Hundreds, and eventually thousands of possible implementations. Making a decision to inline any of them into DefaultMethod.call is basically impossible. We've broken the chain.

To make matters worse, we also have the set of Java-wrapping DynamicMethod implementations, *CallbackMethod (used for binding Java code to Ruby method names) and CompiledMethod (used in AOT-compiled code).

The CallbackMethods all wrap another piece of generated code that implements Callback and calls the Java method in question. So we generate nice little wrappers for all the pre-existing methods we want to call, but we also make it impossible for the *CallbackMethod.call implementations to inline any of those calls. Broken chain again.

CompiledMethod is slightly better in this regard, since there's a new CompiledMethod subclass for every AOT-compiled Ruby method, but we still have a single implementaiton of DynamicMethod.call that all of those subclasses share in common. To make matters worse, even if we had separate DynamicMethod.call implementations, that may actually *hurt* our ability to inline code way back in IRubyObject.callMethod, since we've now added N possible DynamicMethod.call implementations to the system. And the chain gets broken even earlier.

So the bottom line here is that in order to continue improving performance, we need to do everything possible to move the call site and the call target closer together. There are a couple standard ways to do it:
  1. Hard-coded special-case code for specific situations, much like YARV does for simple ops (+, -, <, >, etc). In these cases, the compiler would check that the target implements an appropriate type to do a direct call to the operation in question. In Fixnum's case, we'd first confirm it's a RubyFixnum, and then invoke e.g. RubyFixnum.plus directly. That skips all the chain breakage, and allows the compiled code to inline RubyFixnum.plus straight into the call site.
  2. Dynamic generated method adapters that can be swapped out and that learn from previous calls to make direct invocations earlier in the chain. Basically, this would involve preparing call site caches that point at call adapters. Initially, the call adapters would be of some generic type that can use the slow path. But as more and more calls come in, more and more of the call sites would be replaced with specialized implementations that invoke the appropriate target code directly, allowing HotSpot a direct line from call site to call target.
The second version is obviously the ultimate goal, and essentially would mimic what the state-of-the-art JITs do (i.e. this is how HotSpot works under the covers). The first version is easily testable with some simple hackery.

I created a small patch that includes a trivial, unsafe change to the compiler to make Fixnum#+, Fixnum#-, and Fixnum#< direct calls when
possible. They're unsafe because they don't check to see if any of those
operations have been overridden...but of course you'd have to be a mad
fool to override them anyway.

To demonstrate a bit of the potential performance gains, here are some
numbers for JRuby trunk and trunk + patch. Note that Fixnum#+, Fixnum#-, and Fixnum#< are all already STI methods, which does a lot to speed up their invocation (STI uses a table of switch values to bypass dynamic method lookup). But this simple change of compiling direct calls completely blows the STI performance out of the water, and that's without similar direct calls to the fib_ruby method itself.

test/bench/bench_fib_recursive.rb

JRuby trunk without patch:
1.675000 0.000000 1.675000 ( 1.675000)
1.244000 0.000000 1.244000 ( 1.244000)
1.183000 0.000000 1.183000 ( 1.183000)
1.173000 0.000000 1.173000 ( 1.173000)
1.171000 0.000000 1.171000 ( 1.170000)
1.178000 0.000000 1.178000 ( 1.178000)
1.170000 0.000000 1.170000 ( 1.170000)
1.169000 0.000000 1.169000 ( 1.169000)

JRuby trunk with patch:
1.133000 0.000000 1.133000 ( 1.133000)
0.922000 0.000000 0.922000 ( 0.922000)
0.865000 0.000000 0.865000 ( 0.865000)
0.862000 0.000000 0.862000 ( 0.863000)
0.859000 0.000000 0.859000 ( 0.859000)
0.859000 0.000000 0.859000 ( 0.859000)
0.864000 0.000000 0.864000 ( 0.863000)
0.859000 0.000000 0.859000 ( 0.860000)

Ruby 1.8.6:
1.750000 0.010000 1.760000 ( 1.760206)
1.760000 0.000000 1.760000 ( 1.764561)
1.760000 0.000000 1.760000 ( 1.762009)
1.750000 0.010000 1.760000 ( 1.760286)
1.760000 0.000000 1.760000 ( 1.759367)
1.750000 0.000000 1.750000 ( 1.761763)
1.760000 0.010000 1.770000 ( 1.798113)
1.760000 0.000000 1.760000 ( 1.760355)

That's an improvement of over 25%, with about 20 lines of code. It would be even higher with a dynamic adapter for the fib_ruby call. And we can take this further...modify our Java integration code to do direct calls to Java types, modify compiled code to adapt to methods as they are redefined or added to the system, and so on and so forth. There's a ton of potential here.

I will continue working along this path.
Written on July 6, 2007