So You Want To Optimize Ruby

I was recently asked for a list of "hard problems" a Ruby implementation really needs to solve before reporting benchmark numbers. You know...the sort of problems that might invalidate early perf numbers because they impact how you optimize Ruby. This post is a rework of my response...I hope you find it informative!

Fixnum to Bignum promotion

In Ruby, Fixnum math can promote to Bignum when the result is out of Fixnum's range. On implementations that use tagged pointers to represent Fixnum (MRI, Rubinius, MacRuby), the Fixnum range is somewhat less than the base CPU bits (32/64). On JRuby, Fixnum is always a straight 64-bit signed value.

This promotion is a performance concern for a couple reasons:

Every math operation that returns a new Fixnum must be range-checked. This slows all Fixnum operations.
It is difficult (if not impossible) to predict whether a Fixnum math operation will return a Fixnum or a Bignum. Since Bignum is always represented as a full object (not a primitive or a tagged pointer) this impacts optimizing Fixnum math call sites.

Floating-point performance

A similar concern is the performance of floating point values. Most of the native implementations have tagged values for Fixnum but only one I know of (Macruby) uses tagged values for Float. This can skew expectations because an implementation may perform very well on integer math and considerably worse on floating-point math due to the objects created (and collected). JRuby uses objects for both Fixnum and Float, so performance is roughly equivalent (and slower than I'd like).

Closures

Any language that supports closures ("blocks" in Ruby) has to deal with efficiently accessing frame-local data from calls down-stack. In Java, both anonymous inner classes and the upcoming lambda feature treat frame-local values (local variables, basically) as immutable...so their values can simply be copied into the closure object or carried along in some other way. In Ruby, local variables are always mutable, so an eventual activation of a closure body needs to be able to write into its containing frame. If a runtime does not support arbitrary frame access (as is the case on the JVM) it may have to allocate a separate data structure to represent those frame locals...and that impacts performance.

Bindings and eval

The eval methods in Ruby can usually accept an optional binding under which to run. This means any call to binding must return a fully-functional execution environment, and in JRuby this means both eval and binding force a full deoptimization of the surrounding method body.

There's an even more unpleasant aspect to this, however: every block can be used as a binding too.

All blocks can be turned into Proc and used as bindings, which means every block in the system has to have full access to values in the containing call frame. Most implementers hate this feature, since it means that optimizing call frames in the presence of blocks is much more difficult. Because they can be used as a binding, that of course means literally all frame data must be accessible: local variables; frame-local $ variables like $~; constants lookup environment; method visibility; and so on.

callcc and Continuation

JRuby doesn't implement callcc since the JVM doesn't support continuations, but any implementation hoping to optimize Ruby will have to take a stance here. Continuations obviously make optimization more difficult since you can branch into and out of execution contexts in rather unusual ways.

Fiber implementation

In JRuby, each Fiber runs on its own thread (though we pool the native thread to reduce Fiber spin-up costs). Other than that they operate pretty much like closures.

A Ruby implementer needs to decide whether it will use C-style native stack juggling (which makes optimizations like frame elimination trickier to implement) or give Fibers their own stacks in which to execute independently.

Thread/frame/etc local $globals

Thread globals are easy, obviously. All(?) host systems already have some repesentation of thread-local values. The tricky ones are explicit frame globals like $~ and $_ and implicit frame-local values like visibility, etc.

In the case of $~ and $_, the challenge is not in representing accesses of them directly but in handling implicit reads and writes of them that cross call boundaries. For example, calling [] on a String and passing a Regexp will cause the caller's frame-local $~ (and related values) to be updated to the MatchData for the pattern match that happens inside []. There are a number of core Ruby methods like this that can reach back into the caller's frame and read or write these values. This obviously makes reducing or eliminating call frames very tricky.

In JRuby, we track all core methods that read or write these values, and if we see those methods called in a body of code (the names, mind you...this is a static inspection), we will stand up a call frame for that body. This is not ideal. We would like to move these values into a separate stack that's lazily allocated only when actually needed, since methods that cross frames like String#[] force other methods like Array#[] to deoptimize too.

C extension support

If a given Ruby implementation is likely to fit into the "native" side of Ruby implementations (as opposed to implementations like JRuby or IronRuby that target an existing managed runtime), it will need to have a C extension story.

Ruby's C extension API is easier to support than some languages' native APIs (e.g. no reference-counting as in Python) but it still very much impacts how a runtime optimizes. Because the API needs to return forever-valid object references, implementations that don't give out pointers will have to maintain a handle table. The API includes a number of macros that provide access to object internals; they'll need to be simulated or explicitly unsupported. And the API makes no guarantees about concurrency and provides few primitives for controlling concurrent execution, so most implementations will need to lock around native downcalls.

An alternative for a new Ruby implementation is to expect extensions to be written in the host runtime's native language (Java or other JVM languages for JRuby; C# or other .NET languages for IronRuby, etc). However this imposes a burden on folks implementing language extensions, since they'll have to support yet another language to cover all Ruby implementations.

Ultimately, though, the unfortunate fact for most "native" impls is that regardless of how fast you can run Ruby code, the choke point is often going to be the C API emulation, since it will require a lot of handle-juggling and indirection compared to MRI. So without supporting the C API, there's a very large part of the story missing...a part of the story that accesses frame locals, closure bodies, bindings, and so on.

Of course if you can run Ruby code as fast as C, maybe it won't matter. :) Users can just implement their extensions in Ruby. JRuby is starting to approach that kind of performance for non-numeric, non-closure cases, but that sort of perf is not yet widespread enough to bank on.

Ruby 1.9 encoding support

Any benchmark that touches anything relating to binary text data must have encoding support, or you're really fudging the numbers. Encoding touches damn near everything, and can add a significant amount of overhead to String-manipulating benchmarks.

Garbage collection and object allocation

It's easy for a new impl to show good performance on benchmarks that do no allocation (or little allocation) and require no GC, like raw numerics (fib, tak, etc). Macruby and Rubinius, for example, really shine here. But many impls have drastically different performance when an algorithm starts allocating objects. Very few applications are doing pure integer numeric algorithms, so object
allocation and GC performance are an absolutely critical part of the performance story.

Concurrency / Parallelism

If you intend to be an impl that supports parallel thread execution, you're going to have to deal with various issues before publishing numbers. For example, threads can #kill or #raise each other, which in
a truly parallel runtime requires periodic safepoints/pings to know whether a cross-thread event has fired. If you're not handling those safepoints, you're not telling the whole story, since they impact execution.

There's also the thread-safety of runtime structures to be considered. As an example, Rubinius until recently had a hard lock around a data structure responsible for invalidating call sites, which meant that its simple inline cache could see a severe performance degradation at polymorphic call sites (they've since added polymorphic caching to ameliorate this case). The thread-safety of a Ruby implementation's core runtime structures can drastically impact even straight-line, non-concurrent performance.

Of course, for an impl that doesn't support parallel execution (which would put it in the somewhat more limited realm of MRI), you can get away with GIL scheduling tricks. You just won't have a very good in-process scaling story.

Tracing/debugging

All current impls support tracing or debugging APIs, though some (like
JRuby) require you to enable support for them via command-line or compile-time flags. A Ruby implementation needs to have an answer for this, since the runtime-level hooks required will have an impact...and may require users to opt-in.

ObjectSpace

ObjectSpace#each_object needs to be addressed before talking about performance. In JRuby, supporting each_object over arbitrary types was a major performance issue, since we had to track all objects in a separate data structure in case they were needed. We ultimately decided each_object would only work with Class and Module, since those were the major practical use cases (and tracking Class/Module hierarchies is far easier than tracking all objects in the system).

Depending on how a Ruby implementation tracks in-memory objects (and depending on the level of accuracy expected from ObjectSpace#each_object) this can impact how allocation logic and GC are optimized.

Method invalidation

Several implementations can see severe global effects due to methods like Object#extend blowing all global caches (or at least several caches), so you need to be able to support #extend in a reasonable way before talking about performance. Singleton objects also have a similar effect, since they alter the character of method caches by introducing new anonymous types at any time (and sometimes, in rapid succession).

In JRuby, singleton and #extend effects are limited to the call sites that see them. I also have an experimental branch that's smarter about type identity, so simple anonymous types (that have only had modules included or extended into them) will not damage caches at all. Hopefully we'll land that in a future release.

Constant lookup and invalidation

I believe all implementations have implemented constant cache invalidation as a global invalidation, though there are other more complicated ways to do it. The main challenge is the fact that constant lookup is tied to both lexical scope and class hiearchy, so invalidating individual constant lookup sites is usually infeasible. Constant lookup is also rather tricky and must be implemented correctly before talking about the performance of any benchmark that references constants.

Rails

Finally, regardless of how awesome a new Ruby implementation claims to be, most users will simply ask "but does it run Rails?" You can substitute your favorite framework or library, if you like...the bottom line is that an awesome Ruby implementation that doesn't run any Ruby applications is basically useless. Beware of crowing about your victory over Ruby performance before you can run code people actually care about.

Written on October 15, 2012