Finding a JVM compilation strategy for Ruby's dynamic nature
In JRuby, we have a number of things we "decorate" the Java stack with for Ruby execution purposes. Put simply, we pass a bunch of extra context on the call stack for most method calls. At its most descriptive, making a method call passes the following along:
The problem arises when compiling Ruby code into Java bytecode. The case I'm looking at involves one of our benchmarks where a local variable is accessed and "to_i" is invoked on it a large number of times:
The block being passed to Benchmark.measure gets compiled into its own Java method on the resulting class, called something like closure0. This gets further bound into a CompiledBlock adapter which is what's eventually called when the block gets invoked.
Unfortunately all the additional context and overhead required in the compiled Ruby code seems to be causing trouble for hotspot.
In this case, the pieces causing the most trouble are obviously the "a.to_i" bits. I'll break that down.
"a" is a local variable in the same lexical scope, so we go to a local variable in closure0 that holds an array of local variable values.
Then there's the to_i call. This is where it starts to get a little ugly. to_i is basically a "toInteger" method, and in this case, calling against a Ruby Fixnum, it doesn't do anything but return "self". So it's a no-arg noop for the most part.
The resulting bytecode to do the call ends up being uncomfortably long:
(assumes we already have the receiver, a Fixnum, on the stack)
100 a.to_i calls * (7 + 15 ops) = 2200 ops
That's 2200 operations to do 100 variable accesses and calls, where in Java code it would be more like 200 ops (aload + invokevirtual). An order of magnitude more work being done.
The closure above when run through my current compiler generates a Java method of something like 4000 bytes. That may not sound like a lot, but it seems to be hitting a limit in HotSpot that prevents it being JITed quickly (or sometimes, at all). And the size and complexity of this closure are certainly reasonable, if not common in Ruby code.
There's a few questions that come out of this, and I'm looking for more ideas too.
- a ThreadContext object, for accessing JRuby call frames and variable scopes
- the receiver object
- the metaclass for the receiver object
- the name of the method
- a numeric index for the method, used for a fast dispatch mechanism
- an array of arguments to the method
- the type of call being performed (functional, normal, or variable)
- any block/closure being passed to the method
The problem arises when compiling Ruby code into Java bytecode. The case I'm looking at involves one of our benchmarks where a local variable is accessed and "to_i" is invoked on it a large number of times:
puts Benchmark.measure {(that's 100 accesses and calls in a 1 million loop)
a = 5;
i = 0;
while i < 1000000
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i; a.to_i;
i += 1;
end
}
The block being passed to Benchmark.measure gets compiled into its own Java method on the resulting class, called something like closure0. This gets further bound into a CompiledBlock adapter which is what's eventually called when the block gets invoked.
Unfortunately all the additional context and overhead required in the compiled Ruby code seems to be causing trouble for hotspot.
In this case, the pieces causing the most trouble are obviously the "a.to_i" bits. I'll break that down.
"a" is a local variable in the same lexical scope, so we go to a local variable in closure0 that holds an array of local variable values.
aload(SCOPE_INDEX)But for Ruby purposes we must also make sure a Java null is replaced with "nil" so we have an actual Ruby object
ldc([index of a])
aaload
dupSo every variable access is at least seven bytecodes, since we need to access them from an object that can be shared with contained closures.
ifnonnull(ok)
pop
aload(NIL_INDEX) # immediately stored when method is invoked
label(ok)
Then there's the to_i call. This is where it starts to get a little ugly. to_i is basically a "toInteger" method, and in this case, calling against a Ruby Fixnum, it doesn't do anything but return "self". So it's a no-arg noop for the most part.
The resulting bytecode to do the call ends up being uncomfortably long:
(assumes we already have the receiver, a Fixnum, on the stack)
dup # dup receiverSo we're looking at roughly 15 operations to do a single no-arg call. If we were processing argument lists, it would obviously be more, especially since all argument lists eventually get stuffed into an IRubyObject[]. Summed up, this means:
invokevirtual "getMetaClass"
invokevirtual "getDispatcher" # a fast switch-based dispatcher
swap # dispatcher under receiver
aload(THREADCONTEXT)
swap # threadcontext under receiver
dup # dup receiver again
invokevirtual "getMetaClass" # for call purposes
ldc(methodName)
ldc(methodIndex)
getstatic(IRubyObject.EMPTY_ARRAY) # no args
ldc(call type)
getstatic(Block.NULL_BLOCK) # no closure
invokevirtual "Dispatcher.callMethod..."
100 a.to_i calls * (7 + 15 ops) = 2200 ops
That's 2200 operations to do 100 variable accesses and calls, where in Java code it would be more like 200 ops (aload + invokevirtual). An order of magnitude more work being done.
The closure above when run through my current compiler generates a Java method of something like 4000 bytes. That may not sound like a lot, but it seems to be hitting a limit in HotSpot that prevents it being JITed quickly (or sometimes, at all). And the size and complexity of this closure are certainly reasonable, if not common in Ruby code.
There's a few questions that come out of this, and I'm looking for more ideas too.
- How bad is it to be generating large Java methods and how much impact does it have on HotSpot's ability to optimize?
- This code obviously isn't optimal (two calls to getMetaClass, for example), but the size of the callMethod signature means even optimal code will still have a lot of argument loading to do. Any ideas on how to get around this in a general way? I'm thinking my only real chance is to find simpler signatures to invoke, such as arity-specific (so there's no requirement for an array of args), avoiding passing context that usually isn't needed (an object knows its metaclass already), and reverting back to a ThreadLocal to get the ThreadContext (though that was a big bottleneck for us before...).
- Is the naive approach of breaking methods in two when possible "good enough"?
Written on July 11, 2007