Finding a JVM compilation strategy for Ruby's dynamic nature

In JRuby, we have a number of things we "decorate" the Java stack with for Ruby execution purposes. Put simply, we pass a bunch of extra context on the call stack for most method calls. At its most descriptive, making a method call passes the following along:
  • a ThreadContext object, for accessing JRuby call frames and variable scopes
  • the receiver object
  • the metaclass for the receiver object
  • the name of the method
  • a numeric index for the method, used for a fast dispatch mechanism
  • an array of arguments to the method
  • the type of call being performed (functional, normal, or variable)
  • any block/closure being passed to the method
Additionally there are a few places where we also pass the calling object, to use for visibility checks.

The problem arises when compiling Ruby code into Java bytecode. The case I'm looking at involves one of our benchmarks where a local variable is accessed and "to_i" is invoked on it a large number of times:
puts Benchmark.measure {
a = 5;
i = 0;
while i < 1000000
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
i += 1;
end
}
(that's 100 accesses and calls in a 1 million loop)

The block being passed to Benchmark.measure gets compiled into its own Java method on the resulting class, called something like closure0. This gets further bound into a CompiledBlock adapter which is what's eventually called when the block gets invoked.

Unfortunately all the additional context and overhead required in the compiled Ruby code seems to be causing trouble for hotspot.

In this case, the pieces causing the most trouble are obviously the "a.to_i" bits. I'll break that down.

"a" is a local variable in the same lexical scope, so we go to a local variable in closure0 that holds an array of local variable values.
 aload(SCOPE_INDEX)
ldc([index of a])
aaload
But for Ruby purposes we must also make sure a Java null is replaced with "nil" so we have an actual Ruby object
 dup
ifnonnull(ok)
pop
aload(NIL_INDEX) # immediately stored when method is invoked
label(ok)
So every variable access is at least seven bytecodes, since we need to access them from an object that can be shared with contained closures.

Then there's the to_i call. This is where it starts to get a little ugly. to_i is basically a "toInteger" method, and in this case, calling against a Ruby Fixnum, it doesn't do anything but return "self". So it's a no-arg noop for the most part.

The resulting bytecode to do the call ends up being uncomfortably long:

(assumes we already have the receiver, a Fixnum, on the stack)
 dup # dup receiver
invokevirtual "getMetaClass"
invokevirtual "getDispatcher" # a fast switch-based dispatcher
swap # dispatcher under receiver
aload(THREADCONTEXT)
swap # threadcontext under receiver
dup # dup receiver again
invokevirtual "getMetaClass" # for call purposes
ldc(methodName)
ldc(methodIndex)
getstatic(IRubyObject.EMPTY_ARRAY) # no args
ldc(call type)
getstatic(Block.NULL_BLOCK) # no closure
invokevirtual "Dispatcher.callMethod..."
So we're looking at roughly 15 operations to do a single no-arg call. If we were processing argument lists, it would obviously be more, especially since all argument lists eventually get stuffed into an IRubyObject[]. Summed up, this means:

100 a.to_i calls * (7 + 15 ops) = 2200 ops

That's 2200 operations to do 100 variable accesses and calls, where in Java code it would be more like 200 ops (aload + invokevirtual). An order of magnitude more work being done.

The closure above when run through my current compiler generates a Java method of something like 4000 bytes. That may not sound like a lot, but it seems to be hitting a limit in HotSpot that prevents it being JITed quickly (or sometimes, at all). And the size and complexity of this closure are certainly reasonable, if not common in Ruby code.

There's a few questions that come out of this, and I'm looking for more ideas too.
  1. How bad is it to be generating large Java methods and how much impact does it have on HotSpot's ability to optimize?
  2. This code obviously isn't optimal (two calls to getMetaClass, for example), but the size of the callMethod signature means even optimal code will still have a lot of argument loading to do. Any ideas on how to get around this in a general way? I'm thinking my only real chance is to find simpler signatures to invoke, such as arity-specific (so there's no requirement for an array of args), avoiding passing context that usually isn't needed (an object knows its metaclass already), and reverting back to a ThreadLocal to get the ThreadContext (though that was a big bottleneck for us before...).
  3. Is the naive approach of breaking methods in two when possible "good enough"?
It should be noted that HotSpot eventually does JIT this code, it's substantially faster than the current general release of Ruby 1.8. But I'm worried about the complexity of the bytecode and actively looking for ways to simplify.
Written on July 11, 2007