Running JRuby on the Graal JIT
Hello, friends! Long time no blog!
I'm still here hacking away on JRuby for the benefit of Rubyists everywhere, usually slogging through compatibility fixes and new Ruby features. However with the release of JRuby 9.2, we've caught up to Ruby 2.5 (the current release) and I'm spending a little time on performance.
I thought today would be a good opportunity to show you how to start exploring some next-generation JRuby performance by running on top of the Graal JIT.
I'm still here hacking away on JRuby for the benefit of Rubyists everywhere, usually slogging through compatibility fixes and new Ruby features. However with the release of JRuby 9.2, we've caught up to Ruby 2.5 (the current release) and I'm spending a little time on performance.
I thought today would be a good opportunity to show you how to start exploring some next-generation JRuby performance by running on top of the Graal JIT.
Graal
The Graal JIT is a project that grew out of a Sun Microsystem Labs project called MaxineVM. Maxine was an attempt to implement a JVM entirely in Java...via various specialized Java dialects, annotations, and compilation strategies. Graal is the latest iteration of the JIT work done for Maxine, and provides an optimizing JVM JIT implemented entirely in Java. Maxine only lives on as a research project, but Graal is rapidly shaping up to become the preferred JIT in OpenJDK.
The majority of optimizations that Graal does to code are no different from the "classic" OpenJDK JIT, Hotspot's "C2" compiler. You get all the usual dead code, loop unrolling, method inlining, branch profiling, and so on. However Graal goes beyond Hotspot in a few key ways. Most interesting to JRuby (and other dynamic languages) is the fact that Graal finally brings good Escape Analysis to the JVM.
Escape Analysis
The biggest gains JRuby sees running on Graal are from Escape Analysis (EA). The basic idea behind escape analysis goes like this: if you allocate an object, use it and its contents, and then abandon that object all within the same thread-local piece of code (i.e. not storing that object in a globally-visible location) then there's no reason to allocate the object. Think about this in terms of Java's autoboxing or varargs: if you box arguments or numbers, pass them to a method, and then unbox them again...the intermediate box was not really necessary.
The basic idea of EA is not complicated to understand, but implementing well can be devilishly hard. For example, the Hotspot JIT has had a form of escape analysis for years, optional in Java 7 and I believe turned on by default during Java 8 maintenance releases. However this EA was very limited...if any wrapper object was used across a branch (even a simple if/else) or if it might under any circumstance leave the compiled code being optimized, it would be forced to allocate every time.
The Graal EA improves on this with a technique called Partial Escape Analysis (PEA). In PEA, branches and loops do not interfere with the process of eliminating objects because all paths are considered. In addition, if there are boxed values eventually passed out of the compiled code, their allocation can be deferred until needed.
JRuby on Graal
By now you've likely heard about TruffleRuby, a Graal-based Ruby implementation that uses the Truffle framework for implementing all core language features and classes. Truffle provides many cool features for specializing data structures and making sure code inlines, but many of the optimizations TR users see is due to Graal doing such a great job of eliminating transient objects.
And because Graal is also open source and available in Java 10 and higher, JRuby can see some of those benefits!
There's two easy ways for you to test out JRuby on the Graal JIT
Using Java 10 with Graal
OpenJDK 9 included a new feature to pre-compile Java code to native (with the "jaotc" command), and this compiler made use of Graal. In OpenJDK 10, Graal is now included even on platforms where the ahead-of-time compiler is not supported.
Download and install any OpenJDK 10 (or higher) release, and pass these flags to the JVM (either with -J flags to JRuby or using the JAVA_OPTS environment variable):
-XX:+UnlockExperimentalVMOptions
-XX:+EnableJVMCI
-XX:+UseJVMCICompiler
Using GraalVM
GraalVM is a new build of OpenJDK that includes Graal and Truffle by default. Depending on which one you use (community or enterprise edition) you may also have access to additional proprietary optimizations.
GraalVM can be downloaded in community form for Linux and enterprise form for Linux and MacOS from graalvm.org. Install it and set it up as the JVM JRuby will run with (JAVA_HOME and/or PATH, as usual). It will use the Graal JIT by default.
Additional JRuby Flags
You will also want to include some JRuby flags that help us optimize Ruby code in ways that work better on Graal:
- -Xcompile.invokedynamic enables our use of the InvokeDynamic feature, which lets dynamic calls inline and optimize like static Java calls.
- -Xfixnum.cache=false disables our cache of small Fixnum objects. Using the cache helps on Hotspot, which has no reliable escape analysis, but having those objects floating around sometimes confuses Graal's partial escape analysis. Try your code with the cache on and off and let us know how it affects performance.
What to Expect
We have only been exploring how JRuby can make use of Graal over the past few months, but we're already seeing positive results on some benchmarks. However the optimizations we want to see are heavily dependent on inlining all code, an area where we need some work. I present two results here, one where Graal is working like we expect, and another where we're not yet seeing our best results.
Numeric Algorithms Look Good
One of the biggest areas JRuby performance suffers is in numeric algorithms. On a typical Hotspot-based JVM, all Fixnum and Float objects are actually object boxes that must be allocated and garbage collected like any other object. As you'd expect, this means that numeric algorithms pay a very high cost. This also means that Graal's partial escape analysis gives us huge gains, because all those boxes get swept away.
This first result is from a pure-Ruby Mandelbrot fractal-generating algorithm that makes the rounds periodically. The math here is almost all floating-point, with a few integers thrown in, but the entire algorithm fits in a single method. With JRuby using invokedynamic and running on Graal, all the code inlines and optimizes like a native numeric algorithm! Hooray for partial escape analysis!
We also have anecdotal reports of other numeric benchmarks performing significantly better with JRuby on Graal than JRuby on Hotspot...and in some cases, JRuby on Graal is the fastest result available!
Data Structures Need More Work
Of course most applications don't just work with numbers. They usually have a graph of objects in memory they need to traverse, search, create and destroy. In many cases, those objects include Array and Hash instances rather than user-defined objects, and frequently these structures are homogeneous: they contain only numbers, for example.
JRuby currently does not do everything it could to inline object creation and access. We also are not doing any numeric specialization of structures like Array, which means a list of Fixnums actually has to allocate all those Fixnum objects and hold them in memory. These are areas we intend to work on; I am currently looking at doing some minimal specialization of algorithmic code and numeric data structures, and we will release some specialization code for instance variables (right-sizing the object rather than using a separate array to hold instance variables) in JRuby 9.2.1.
The red/black benchmark tests the performance of a pure-Ruby red/black tree implementation. It creates a large graph, traverses it, searches it, and clears it. JRuby with InvokeDynamic on Hotspot still provides the best result here, perhaps because the extra magic of Graal is not utilized well.
This benchmark of ActiveRecord shows the opposite result: JRuby on Graal gets the best performance. Without having dug into the details, I'd guess there's some hot loop code in the middle of the "find all" logic, and that loop is optimizing well on Graal. But ultimately the objects all need to be created and the gain from Graal is fairly limited. I also have examples of other read, write, and update operations and only about half of them are faster with Graal.
Your Turn
The JRuby team (of which only two of us are actually employed to work on JRuby) has always managed resources with an eye for compatibility first. This has meant that JRuby performance -- while usually solid and usually faster than CRuby -- has received much less attention than some other Ruby implementations. But it turns out we can get really solid performance simply by inlining all appropriate code, specializing appropriate data structures, and running atop a JIT with good escape analysis.
We will be working over the next few months to better leverage Graal. In the mean time, we'd love to hear from JRuby users about their experiences with JRuby on Graal! If your code runs faster, let us know so we can tell others. If your code runs slower, let us know so we can try to improve it. And if you're interested in comparing with other Ruby implementations, just make sure your benchmark reflects a real-world case and doesn't simply optimize down to nothing.
JRuby on Graal has great promise for the future. Try it out and let us know how it goes!
Written on July 3, 2018