Wednesday, August 10, 2011

Invokedynamic in JRuby: Constant Lookup

This is the first of a set (not a series...there's no particular order) of articles I'll write on how JRuby is using invokedynamic. Hopefully they will show Rubyists how drastically invokedynamic is going to improve JRuby, and show other JVM language folks how to use invokedynamic effectively.

Hello friends!

I figured it's about time for me to start writing a bit on how JRuby is actually using invokedynamic.

As of today, JRuby utilizes invokedynamic far more than any other mainstream JVM language. We have worked very closely with the JSR leads and the OpenJDK developers to make sure invokedynamic runs well. And we have been advocating invokedynamic as a game-changer for the JVM and for JVM languages.

Let's explore one area where JRuby is using invokedynamic: Ruby's "constant" lookup.

Non-constant "Constants"

A constant in Ruby is defined on a class or module, and is subject to Ruby's typical namespacing logic. Constants start with a capital letter.

I often put "constants" in parentheses because constant values can be reassigned. This will usually produce a warning...but not an error. This means we can't simply look up constant values once and never look them up again (without special tricks I'll get into later).

Constant lookup is a also bit more complicated than method lookup. When retrieving a constant, Ruby first scans lexically-enclosing scopes' classes and modules for the constant. If the constant can't be found, the next search walks the current class's inheritance hierarchy. If we still can't find the constant, const_missing is called on the current class.

In order to make constant lookup fast, we want to do some sort of caching. In classic JRuby, Ruby 1.9 (YARV), Rubinius, and probably most other modern Ruby implementations, this is done with a global serial number. Whenever a constant is updated or a module is included (changing the inheritance hierarchy) all cached constants everywhere are forced to lookup again.

I have played with mechanisms for reducing the global impact of constant invalidation, but because constants can be looked up lexically it's simply too complicated to localize (since we need invalidate classes down-hierarchy from the change and we also need to update all lexical scopes that might see the change).

Constant Invalidation in JRuby 1.6

The logic in JRuby 1.6 goes something like this:

  • If cache is empty or invalid, retrieve the constant value in the usual way (lexical, hierarchical search). Store the value with the current global constant serial number.
  • On subsequent lookups, check cache for validity against the global constant serial number. If we have a value cached and the cache is still valid, return it.
  • If any constant in the system is updated, or if a module is included into an existing class hierarchy, flip the serial number and force future constant lookups to re-cache.
This turns out to work fairly well. The same mechanism in Ruby 1.9 produced drastically faster constant lookups, and JRuby's performance is even better than 1.9.

But there's a problem here. Because there's this constant pinging of the global constant serial number, every constant access can potentially produce a new value. So we're paying the cost to check that serial number as well as interfering with optimizations that want to see constant values actually be constant.

Can we do better?

Quick Invokedynamic Primer

The main atom of invokedynamic is the MethodHandle. Method handles are essentially function pointers, which can point at Java methods or fields, constructors, constant values, or other method handles. Invokedynamic also provides the MethodHandles utility class, which lets us juggle method handles in various ways:
  • adapting method signatures by casting, adding, moving, or dropping arguments
  • combining three handles ("test", "target", and "fallback") to form new a "guard with test" if-statement-like handle
  • wrap handles with exception handling or argument/return pre/post-processing
You can think of method handles and the chains of adapter handles that stitch them together as a special sort of functional language the JVM knows how to optimize. Given a chain of handles, you should usually get a piece of code that optimizes as well as (or better, in some cases) writing the same logic by hand in Java.

The invokedynamic bytecode simply provides a place to plug a method handle chain into code. When the JVM encounters an invokedynamic bytecode, it calls a "bootstrap method" associated with that bytecode for further instructions.

The bootstrap method returns a CallSite object, provided in java.lang.invoke. There are constant call sites for constant values, mutable call sites for when the target handle chain may have to change, and volatile call sites for when those changes must immediately be reflected across threads.

Once a CallSite has been installed for a given invokedynamic, subsequent hits skip the bootstrapping process, and we're off to the races.


I mentioned that the MethodHandles class provides a "guardWithTest" method for combining a test, a target (the "then" branch), and a fallback (the "else" branch). SwitchPoint, also in java.lang.invoke, acts like an on/off guardWithTest that once turned off can never be turned on again. You provide a target and fallback, and until the "switch" is thrown the target will be invoked. After the switch is thrown the fallback will be called.

What's the difference between this and a guardWithTest where the test just pings some global value? The difference is that SwitchPoint doesn't need to check anything.

Optimization and Deoptimization in the JVM

When the JVM decides to optimize a piece of code, it does so in an optimistic way. In very broad terms, this means it assumes its information up to this point is perfect: no new methods or classes will be introduced, profiling information is accurate, etc. Based on this "perfect" view of the world, it aggressively optimizes code.

Of course, the world isn't perfect. The JVM has to give up profiling and monitoring at some point, so it always has an imperfect view of the system. In order to avoid its aggressive optimizations triggering a fatal error later on, JVMs like OpenJDK (Hotspot) do something called deoptimization.

Deoptimization is the process by which running, optimized code can adapt on-the-fly to a changing system. In OpenJDK, there's several ways this is accomplished:
  • Branches out of compiled code back into the interpreter, when compiled code is determined to be invalid.
  • Guards around inlined virtual method accesses, to ensure we're still calling against the same class.
  • On-stack replacement, for fixing up a running method already on the native call stack
  • ...
Because of this ability to deoptimize, it's possible to support zero-cost guards at the JVM level. Returning to SwitchPoint, we can see how this new form of "guardWithTest" can be basically free: we're explicitly telling the JVM this switch is a rare occurrence it can optimize aggressively.

SwitchPoint for Constant Lookup

JRuby on invokedynamic uses SwitchPoint for constant lookup, as you'd expect. Instead of actively pinging that global constant serial number, we instead use a global SwitchPoint object to guard all cached constant accesses. When it comes time to invalidate the system's constants, we just flip the SwitchPoint off and create a new one. All SwitchPoint-guarded constant accesses in the system must then recache and use the new SwitchPoint.

In a well-behaved system, we should reach a steady state where no new constants are being defined and no new modules are being introduced. Because we're using SwitchPoint, the stable state means all constant accesses are treated as truly constant by the JVM, allowing optimizations that were impossible before. And of course this also means that we've achieved constant lookup performance very near a theoretical maximum.


First, a caveat: SwitchPoint is implemented in a fairly naïve way in the released OpenJDK 7, using a volatile field as the switch value. As a result, SwitchPoint guardWithTest is very slow currently, and JRuby's SwitchPoint-based constant logic must be enabled. I show numbers below based on leading-edge Hotspot compiler patches that will go into the first update release (numbers provided by one of the Hotspot devs, Christian Thalinger...thanks Christian!)

The benchmark we're running is a modified version of bench_const_lookup in JRuby's benchmark suite. The modification here runs more iterations (10M instead of 1M) with more constant lookups (50 instead of 10) to get a better idea of optimized performance.

Here's JRuby running our constant-lookup benchmark without SwitchPoint-based constants on Java 7:

As I said before, this is pretty good. JRuby's existing constant lookup performance is roughly 2x faster than Ruby 1.9.2.

Next, we'll try JRuby with SwitchPoint constants on Java 7 (released version, so we expect this to be slow):

The perf hit of purely volatile SwitchPoint is apparent.

And finally, JRuby with SwitchPoint constants on a dev build of Hotspot, which uses deoptimization rather than a volatile field:

This is basically the performance of the 10M iteration loop alone. In fact, if you look at the resulting optimized assembly, the constant accesses have been eliminated entirely since they're optimistically inlined and never used. Of course this would normally not happen in real code, but it shows how much better the JVM can optimized Ruby's behavior using invokedynamic.


  1. Seems unhelpful to compare benchmarks where one shows the work completely optimized away. What is the comparison if the value is actually computed and returned?

  2. Brian: That's certainly true, but it demonstrates that the logic of looking up the constant is completely optimized away. As a result, if the constant isn't used, it's not retrieved.

    I'm working on some better benchmarks now, since invokedynamic is starting to optimize Ruby code so well that it disappears in trivial benchmarks. That's a good thing!

    In any case, if the constant accesses weren't optimizing completely away, they'd each be equivalent to a single memory access...or to a register access, if Hotspot decides to optimize it that way. In any case, it's just about as fast as it can possibly be.

  3. Wonderful stuff - you make Invokedynamic sound like magic pixie dust!

    Please, please write more posts like this - they're great to read :-)

  4. Why wouldn't the constants normally be inlined in real code?

  5. I imagine the global serial number check was a small amount of bytecode. How much more bytecode is associated with using a call site instead? Do you have a feel yet for how much code size impact there will be using call sites everywhere?

  6. Why did you get my hopes up like that!? When I saw the title I thought you meant that invokedynamic is O(1).

    What is the time complexity of invokedynamic?

  7. jedediah: The constant lookup was subject to several levels in indirection. The full logic was more like:

    * Retrieve current JRuby runtime (field access against on-stack ThreadContext object)

    * Get constant serial number

    * Get cache object for current script (usually on stack)

    * Compare cached serial number with actual (array in cache object)

    * Retrieve cached constant (array in cache object)

    All this logic can inline, but because of the variability and memory indirections, none of it can be eliminated. The new logic stores the constant value directly at the access site, so there's no indirection. SwitchPoint then handles throwing that optimized version away if anything changes.

    Rich: In this case, the constant caching via invokedynamic results in *nearly zero* bytecode. I emit an invokedynamic that's bootstrapped with a MutableCallSite. The site's target is a SwitchPoint.guardWithTest that wraps the actual constant value. Ultimately, SwitchPoint and MutableCallSite logic don't make it into asm, so the constant access is a direct memory retrieval from a location known to be immutable. As a result, it all folds away if the value is not used.

  8. Paul: I'm not sure where joke ends and real question begins :) I'll answer the whole thing as though it's serious.

    Of course I'm talking about lookup for Ruby constants. In the cached case, this is O(1) in both versions, though it's a much smaller 1 with invokedynamic. In the uncached case, it's O(N) where N is the number of lexical scopes + hierarchical ancestors to search.

  9. Charles: So you are saying you can set up a call site, bootstrap method etc with no bytecode? I'm not talking about what gets run per call, but the code for setting up the sites - i.e. how much bigger is your program?

  10. Rich: Ahh, the actual plumbing.

    The program itself contains *only* the invokedynamic bytecode and associated constant pool entries. The bootstrap method can live anywhere, such as in your own runtime library, and you just re-use it for the invokedynamic bytecodes.

    As an example, here's a call to a + b in bytecode. This is *all* that is emitted, other than some boilerplate that I've omitted:

    So tl;dr answer to your question is: the bootstrap method has to live somewhere, but not necessarily in the code that contains invokedynamic.

  11. This is great stuff. Although I don't personally use JRuby/Ruby, I do use several other JVM-based languages, including developing an internal one. I have looked into seeing how invokedynamic could improve our language. It is great to see how the community uses invokedynamic and its performance benefits.

    In your future 'sets' of blog articles on invokedynamic, can you add more code/context such as the creation of the bootstrap and the call sites. Many of the examples out there are either too simple or just way outdated (to the point that even the package/class names no longer apply). I'd enjoy being able to see the different use cases of building up the call sites.

    In any sense, keep up the great work. It is awesome that dynamic JVM languages can finally become highly performant.

  12. Nicholas: Yes, I think I'm long overdue for an epic post on invokedynamic with code examples, etc. I'll try to do that some time soon, since I think I have a pretty good grasp of the practical implications now.

  13. Hi Charles,

    I'm trying to write a memcache gem for jRuby at the moment and I ran my performance benchmarks with vanilla OpenJDK 7 and jruby-head.
    For some reason this very simple thing is slower than 1.6.4.
    Are there any specific versions of Java and Jruby you suggest to try?
    If you'd like I'll send you some benchmarks later on.