Wednesday, September 26, 2012

Explanation of Warnings From MRI's Test Suite

JRuby has, for some time now, run the same test suite as MRI (C Ruby, Matz's Ruby). Because not all tests pass, we use minitest-excludes to mask out the failures, and over time we unmask stuff as we fix it.

However, there's a number of warnings we get from the suite that are nonfatal and unmaskable. I thought I'd show them to you and tell their stories.

JRuby 1.9 mode only supports the `psych` YAML engine; ignoring `syck`

When we started implementing support for the new "psych" YAML engine that Aaron Patterson created (atop libyaml) for Ruby 1.9, we decided that we would not support the broken "syck" engine anymore. The libyaml version is strictly YAML spec compliant, and this is our contribution to ridding the world of "syck"'s broken YAML forever.

GC.stress= does nothing on JRuby

JRuby does not have direct control over the JVM's GC, and so we can't implement things like GC.stress=, which MRI uses to put the GC into "stress" mode (GCing much more frequently to better test GC stability and behavior). There are flags for the JVM to do this sort of testing, but since we don't really need to test the JVM's GC for correctness and stability, we have not exposed those flags directly.

This flag is used in a number of MRI tests to force GC to happen more often and/or to actually test GC behaviors.

SAFE levels are not supported in JRuby

JRuby does not support standard Ruby's security model, "safe levels", because we believe safe levels are a flawed, too-coarse mechanism. On JRuby, you can use standard Java security policies.

We have debated mapping the various Ruby safe levels to equivalent sets of Java security permissions, but have never gotten around to it.

GC.enable does nothing on JRuby / GC.disable does nothing on JRuby

There's no standard API on the JVM to disable the garbage collector completely, so GC.enable and GC.disable do nothing in JRuby.

It's also interesting to note that while you can request a GC run from the JVM by calling System.gc, JRuby also stubs out Ruby's GC.start. We opted to do this because GC.start is used in some Ruby libraries as a band-aid around Ruby's sometimes-slow GC, but the same call on JRuby is both unnecessary (because GC overhead is rarely a problem) and a major performance hit (because it triggers a full GC over the entire heap).

Sunday, September 16, 2012

An experiment in static compilation of Ruby: FASTRUBY!

While at GoGaRuCo this weekend, I finally made good on an experiment I had been thinking about for a while: a static compiler for Ruby. I thought I'd share it with you good people today.

First we have a simple Ruby script with a class in it:

We compile it with fastruby, and it produces two .java source files: and implements the methods the Ruby class does in the script, and calls the same methods (with some mangling for invalid Java method names like _plus_ and _lt_). implements stubs for all method names seen in the script. As a result, all dynamic calls can just be virtual invocations against RObject. Classes that implement one of the methods will just work and the call is direct. Classes that don't implement the called method will raise an error.

RKernel comes with fastruby, and provides Kernel-level methods like "puts", plus methods for coercing to Java types like toBoolean and toString. It also caches some built-in singleton values like nil.

And there's a few other classes for this script to work. It should be easy to see how we could fill them out to do everything the equivalent Ruby classes do.

I don't have any support for a "main" method yet, so I wrote a little runner script to test it.

And away we go!

This is about 30% faster than JRuby with invokedynamic. It is not doing any boundschecking (for rolling over to Bignum) but it is also not caching 1...256 Fixnum objects like JRuby does, nor caching them in any calls along the way (note that it creates three new RFixnums for every recursion that JRuby would not recreate). I call that pretty good.

Obviously because this is designed to compile the whole system at once, we could also emit optimized versions of methods that look like they're doing math. That is yet to come, if I continue this little experiment at all.

There's also some fun possibilities here. By specifying Java types, the compiler could add normal Java methods. Implementing interfaces could be done directly. And Android applications built with this tool would be entirely statically optimizable, only shipping the small amount of code they actually call and having a very minimal runtime.

Pretty neat?

Tuesday, September 4, 2012

Avoiding Hash Lookups in a Ruby Implementation

I had an interesting realization tonight: I'm terrified of hash tables. Specifically, my work on JRuby (and even more directly, my work optimizing JRuby) has made me terrified to ever consider using a hash table in the hot path of any program or piece of code if there's any possibility of eliminating it. And what I've learned over the years is that the vast majority of execution-related (as opposed to data-related, purely dynamic-sourced lookup tables) hash tables are totally unnecessary.

Some background might be interesting here.

Hashes are a Language Designer's First Tool

Anyone who's ever designed a simple language knows that pretty much everything you do is trivial to implement as a hash table. Dynamically-expanding tables of functions or methods? Hash table! Variables? Hash table! Globals? Hash table!

In fact, some languages never graduate beyond this phase and remain essentially gobs and gobs of hash tables even in fairly recent implementations. I won't name your favorite language here, but I will name one of mine: Ruby.

Ruby: A Study in Hashes All Over the Freaking Place

As with many dynamic languages, early (for some definition of "early") implementations of Ruby used hash tables all over the place. Let's just take a brief tour through the many places hash tables are used in Ruby 1.8.7

(Author's note: 1.8.7 is now, by most measures, the "old" Ruby implementation, having been largely supplanted by the 1.9 series which boasts a "real" VM and optimizations to avoid most hot-path hash lookup.)

In Ruby (1.8.7), all of the following are (usually) implemented using hash lookups (and of these, many are hash lookups nearly every time, without any caching constructs):
  • Method Lookup: Ruby's class hierarchy is essentially a tree of hash tables that contain, among other things, methods. Searching for a method involves searching the target object's class. If that fails, you must search the parent class, and so on. In the absence of any sort of caching, this can mean you search all the way up to the root of the hierarchy (Object or Kernel, depending what you consider root) to find the method you need to invoke. This is also known as "slow".
  • Instance Variables: In Ruby, you do not declare ahead of time what variables a given class's object instances will contain. Instead, instance variables are allocated as they're assigned, like a hash table. And in fact, most Ruby implementations still use a hash table for variables under some circumstances, even though most of these variables can be statically determined ahead of time or dynamically determined (to static ends) at runtime.
  • Constants: Ruby's constants are actually "mostly" constant. They're a bit more like "const" in C, assignable once and never assignable again. Except that they are assignable again through various mechanisms. In any case, constants are also not declared ahead of time and are not purely a hierarchically-structured construct (they are both lexically and hierarchically scoped), and as a result the simplest implementation is a hash table (or chains of hash tables), once again.
  • Global Variables: Globals are frequently implemented as a top-level hash table even in modern, optimized language. They're also evil and you shouldn't use them, so most implementations don't even bother making them anything other than a hash table.
  • Local Variables: Oh yes, Ruby has not been immune to the greatest evil of all: purely hash table-based local variables. A "pure" version of Python would have to do the same, although in practice no implementations really support that (and yes, you can manipulate the execution frame to gain "hash like" behavior for Python locals, but you must surrender your Good Programmer's Card if you do). In Ruby's defense, however, hash tables were only ever used for closure scopes (blocks, etc), and no modern implementations of Ruby use hash tables for locals in any way.
There are other cases (like class variables) that are less interesting than these, but this list serves to show how easy it is for a language implementer to fall into the "everything's a hash, dude!" hole, only to find they have an incredibly flexible and totally useless language. Ruby is not such a language, and almost all of these cases can be optimized into largely static, predictable code paths with nary a hash calculation or lookup to be found.

How? I'm glad you asked.

JRuby: The Quest For Fewer Hashes

If I were to sum up the past 6 years I've spent optimizing JRuby (and learning how to optimize dynamic languages) it would be with the following phrase: Get Rid Of Hash Lookups.

When I tweeted about this realization yesterday, I got a few replies back about better hashing algorithms (e.g. "perfect" hashes) and a a few replies from puzzled folks ("what's wrong with hashes?"), which made me realize that it's not always apparent how unnecessary most (execution-related) hash lookups really are (and from now on, when I talk about unnecessary or optimizable hash lookups, I'm talking about execution-related hash lookups; you data folks can get off my back right now).

So perhaps we should talk a little about why hashes are bad in the first place.

What's Wrong With a Little Hash, Bro?

The most obvious problem with using hash tables is the mind-crunching frustration of finding THE PERFECT HASH ALGORITHM. Every year there's a new way to calculate String hashes, for example, that's [ better | faster | securer | awesomer ] than all precedents. JRuby, along with many other languages, actually released a security fix last year to patch the great hash collision DoS exploit so many folks made a big deal about (while us language implementers just sighed and said "maybe you don't actually want a hash table here, kids"). Now, the implementation we put in place has again been "exploited" and we're told we need to move to cryptographic hashing. Srsly? How about we just give you a crypto-awesome-mersenne-randomized hash impl you can use for all your outward-facing hash tables and you can leave us the hell alone?

But I digress.

Obviously the cost of calculating hash codes is the first sin of a hash table. The second sin is deciding how, based on that hash code, you will distribute buckets. Too many buckets and you're wasting space. Too few and you're more likely to have a collision. Ahh, the intricate dance of space and time plagues us forever.

Ok, so let's say we've got some absolutely smashing hash algorithm and foresight enough to balance our buckets so well we make Lady Justice shed a tear. We're still screwed, my friends, because we've almost certainly defeated the prediction and optimization capabilities of our VM or our M, and we've permanently signed over performance in exchange for ease of implementation.

It is conceivable that a really good machine can learn our hash algorithm really well, but in the case of string hashing we still have to walk some memory to give us reasonable assurance of unique hash codes. So there's performance sin #1 violated: never read from memory.

Even if we ignore the cost of calculating a hash code, which at worst requires reading some object data from memory and at best requires reading a cached hash code from elsewhere in memory, we have to contend with how the buckets are implemented. Most hash tables implement the buckets as either of the typical list forms: an array (contiguous memory locations in a big chunk, so each element must be dereferenced...O(1) complexity) or a linked list (one entry chaining to the next through some sort of memory dereference, leading to O(N) complexity for searching collided entries).

Assuming we're using simple arrays, we're still making life hard for the machine since it has to see through at least one and possibly several mostly-opaque memory references. By the time we've got the data we're after, we've done a bunch of memory-driven calculations to find a chain of memory dereferences. And you wanted this to be fast?

Get Rid Of The Hash

Early attempts (of mine and others) to optimize JRuby centered around making hashing as cheap as possible. We made sure our tables only accepted interned strings, so we could guarantee they'd already calculated and cached their hash values. We used the "programmer's hash", switch statements, to localize hash lookups closer to the code performing them, rather than trying to balance buckets. We explored complicated implementations of hierarchical hash tables that "saw through" to parents, so we could represent hierarchical method table relationships in (close to) O(1) complexity.

But we were missing the point. The problem was in our representing any of these language features as hash tables to begin with. And so we started working toward the implementation that has made JRuby actually become the fastest Ruby implementation: eliminate all hash lookups from hot execution paths.

How? Oh right, that's what we were talking about. I'll tell you.

Method Tables

I mentioned earlier that in Ruby, each class contains a method table (a hash table from method name to a piece of code that it binds) and method lookup proceeds up the class hierarchy. What I didn't tell you is that both the method tables and the hierarchy are mutable at runtime.

Hear that sound? It's the static-language fanatics' heads exploding. Or maybe the "everything must be mutable always forever or you are a very bad monkey" fanatics. Whatever.

Ruby is what it is, and the ability to mix in new method tables and patch existing method tables at runtime is part of what makes it attractive. Indeed, it's a huge part of what made frameworks like Rails possible, and also a huge reason why other more static (or more reasonable, depending on how you look at it) languages have had such difficulty replicating Rails' success.

Mine is not to reason why. Mine is but to do and die. I have to make it fast.

Proceeding from the naive implementation, there are certain truths we can hold at various times during execution:
  • Most method table and hierarchy manipulation will happen early in execution. This was true when I started working on JRuby and it's largely true now, in no small part due to the fact that optmizing method tables and hierarchies that are wildly different all the time is really, really hard (so no implementer does it, so no user should do it). Before you say it: even prototype-based languages like Javascript that appear to have no fixed structure do indeed settle into a finite set of predictable, optimizable "shapes" which VMs like V8 can take advantage of.
  • When changes do happen, they only affect a limited set of observers. Specifically, only call sites (the places where you actually make calls in code) need to know about the changes, and even they only need to know about them if they've already made some decision based on the old structure.
So we can assume method hierarchy structure is mostly static, and when it isn't there's only a limited set of cases where we care. How can we exploit that?

First, we implement what's called an "inline cache" at the call sites. In other words, every place where Ruby code makes a method call, we keep a slot in memory for the most recent method we looked up. In another quirk of fate, it turns out most calls are "monomorphic" ("one shape") so caching more than one is usually not beneficial.

When we revisit the cache, we need to know we've still got the right method. Obviously it would be stupid to do a full search of the target object's class hierarchy all over again, so what we want is to simply be able to examine the type of the object and know we're ok to use the same method. In JRuby, this is (usually) done by assigning a unique serial number to every class in the system, and caching that serial number along with the method at the call site.

Oh, but do we know if the class or its ancestors have been modified?

A simple implementation would be to keep a single global serial number that gets spun every time any method table or class hierarchy anywhere in the system is modified. If we assume that those changes eventually stop, this is good enough; the system stabilizes, the global serial number never changes, and all our cached methods are safely tucked away for the machine to branch-predict and optimize to death. This is how Ruby 1.9.3 optimizes inline caches (and I believe Ruby 2.0 works the same way).

Unfortunately, our perfect world isn't quite so perfect. Methods do get defined at runtime, especially in Ruby where people often create one-off "singleton methods" that only redefine a couple methods for very localized use. We don't want such changes to blow all inline caches everywhere, do we?

Let's split up the serial number by method name. That way, if you are only redefining the "foobar" method on your singletons, only inline caches for "foobar" calls will be impacted. Much better! This is how Rubinius implements cache invalidation.

Unfortunately again, it turns out that the methods people override on singletons are very often common methods like "hash" or "to_s" or "inspect", which means that a purely name-based invalidator still causes a large number of call sites to fail. Bummer.

In JRuby, we went through the above mechanisms and several others, finally settling on one that allows us to only ever invalidate the call sites that actually called a given method against a given type. And it's actually pretty simple: we spin the serial numbers on the individual classes, rather than in any global location.

Every Ruby class has one parent and zero or more children. The parent connection is obviously a hard link, since at various points during execution we need to be able to walk up the class hierarchy. In JRuby, we also add a weak link from parents to children, updated whenever the hierarchy changes. This allows changes anywhere in a class hiearchy to cascade down to all children, localizing changes to just that subhierarchy rather than inflicting its damage upon more global scopes.

Essentially, by actively invalidating down-hierarchy classes' serial numbers, we automatically know that matching serial numbers at call sites mean the cached method is 100% ok to use. We have reduced O(N) hierarchically-oriented hash table lookups to a single identity check. Victory!

Instance Variables

Optimizing method lookups actually turned out to be the easiest trick we had to pull. Instance variables defied optimization for a good while. Oddly enough, most Ruby implementations stumbled on a reasonably simple mechanism at the same time.

Ruby instance variables can be thought of as C++ or Java fields that only come into existence at runtime, when code actually starts using them. And where C++ and Java fields can be optimized right into the object's structure, Ruby instance variables have typically been implemented as a hash table that can grow and adapt to a running program as it runs.

Using a hash table for instance variables has some obvious issues:
  • The aforementioned performance costs of using hashes
  • Space concerns; a collection of buckets already consumes space for some sort of table, and too many buckets means you are using way more space per object than you want
At first you might think this problem can be tackled exactly the same way as method lookup, but you'd be wrong. What do we cache at the call site? It's not code we need to keep close to the point of use, it's the steps necessary to reach a point in a given object where a value is stored (ok, that could be considered code...just bear with me for a minute).

There are, however, truths we can exploit in this case as well.
  • A given class of objects will generally reference a small, finite number of variable names during the lifetime of a given program.
  • If a variable is accessed once, it is very likely to be accessed again.
  • The set of variables used by a particular class of objects is largely unique to that class of objects.
  • The majority of the variables ever to be accessed can be determined by inspecting the code contained in that class and its superclasses.
This gives us a lot to work with. Since we can localize the set of variables to a given class, that means we can store something at the class level. How about the actual layout of the values in object instances of that class?

This is how most current implementations of Ruby actually work.

In JRuby, as instance variables are first assigned, we bump a counter on the class that indicates an offset into an instance variable table associated with instances of that class. Eventually, all variables have been encountered and that table and that counter stop changing. Future instances of those objects, then, know exactly how larger the table needs to be and which variables are located where.

Invalidation of a given instance variable "call site" is then once again a simple class identity check. If we have the same class in hand, we know the offset into the object is guaranteed to be the same, and therefore we can go straight in without doing any hash lookup whatsoever.

Rubinius does things a little differently here. Instead of tracking the offsets at runtime, the Rubinius VM will examine all code associated with a class and use that to make a guess about how many variables will be needed. It sets up a table on the class ahead of time for those statically-determined names, and allocates exactly as much space for the object's header + those variables in memory (as opposed to JRuby, where the object and its table are two separate objects). This allows Rubinius to pack those known variables into a tighter space without hopping through the extra dereference JRuby has, and in many cases, this can translate to faster access.

However, both cases have their failures. In JRuby's version, we pay the cost of a second object (an array of values) and a pointer dereference to reach it, even if we can cache the offset 100% successfully at the call site. This translates to larger memory footprints and somewhat slower access times. In Rubinius, variables that are dynamically allocated fall back on a simple hash table, so dynamically-generated (or dynamically-mutated) classes may end up accessing some values in a much slower way than others.

The quest for perfect Ruby instance variable tables continues, but at least we have the tools to almost completely eliminate hashes right now.


The last case I'm going to cover in depth is that of "constant" values in Ruby.

Constants are, as I mentioned earlier, stored on classes in another hash table. If that were their only means of access, they would be uninteresting; we could use exactly the same mechanism for caching them as we do for methods, since they'd follow the same structure and behavior (other than being somewhat more static than method tables). Unfortunately, that's not the case; constants are located based on both lexical and hierarchical searches.

In Ruby, if you define a class or module, all constants lexically contained in that type's enclosing scopes are also visible within the type. This makes it possible to define new lexically-scoped aliased for values that might otherwise be difficult to retrieve without walking a class hierarchy or requiring a parent/child relationship to make those aliases visible. It also defeats nearly all reasonable mechanisms for eliminating hash lookups.

When you access a constant in Ruby, the implementation must first search all lexically-enclosing scopes. Each scope has a type (class or module) associated, and we check that type (and not its parents) for the constant name in question. Failing that, we fall back on the current type's class hierarchy, searching all the way up to the root type. Obviously, this could be far more searching than even method lookup, and we want to eliminate it.

If we had all the space in the world and no need to worry about dangling references, using our down-hierarchy method table invalidation would actually work very well here. We'd simply add another hierarchy for invalidation: lexical scopes. In practice, however, this is not feasible (or at least I have not found a way to make it feasible) since there are many times more lexical scopes in a given system than there are types, and a large number of those scopes are transient; we'd be tracking thousands or tens of thousands of parent/child relationships weakly all over the codebase. Even worse, invalidation due to constant updates or hierarchy changes would have to proceed both down the class hierarchy and throughout all lexically-enclosing scopes in the entire system. Ouch!

The current state of the art for Ruby implementations is basically our good old global serial number. Change a constant anywhere in Ruby 1.9.3, Rubinius, or JRuby, and you have just caused all constant access sites to invalidate (or they'll invalidate next time they're encountered). Now this sounds bad, perhaps because I told you it was bad above for method caching. But remember that the majority of Ruby programmers advise and practice the art of keeping constants...constant. Most of the big-name Ruby folks would call it a bug if your code is continually assigning or reassigning constants at runtime; there are other structures you could be using that are better suited to mutation, they might say. And in general, most modern Ruby libraries and frameworks do keep constants constant.

I'll admit we could do better here, especially if the world changed such that mutating constants was considered proper and advisable. But until that happens, we have again managed to eliminate hash lookups by caching values based on a (hopefully rarely modified) global serial number.

The Others

I did not go into the others because the solutions are either simple or not particularly interesting.

Local variables in any sane language (flame on!) are statically determinable at parse/compile time (rather than being dynamically scoped or determined at runtime). In JRuby, Ruby 1.9.3, and Rubinius, local variables are in all cases a simple tuple of offset into an execution frame and some depth at which to find the appropriate frame in the case of closures.

Global variables are largely discouraged, and usually only accessed at boot time to prepare more locally-defined values (e.g. configuration or environment variable access). In JRuby, we have experimented with mechanisms to cache global variable accessor logic in a way similar to instance variable accessors, but it turned out to be so rarely useful that we never shipped it.

Ruby also has another type of variable called a "class variable", which follows lookup rules almost identical to methods. We don't currently optimize these in JRuby, but it's on my to-do list.

Final Words

There are of course many other ways to avoid hash lookups, with probably the most robust and ambitious being code generation. Ruby developers, JIT compiler writers, and library authors have all used code generation to take what is a mostly-static lookup table and turn it into actually-static code. But you must be careful here to not fall into the trap of simply stuffing your hash logic into a switch table; you're still doing a calculation and some kind of indirection (memory dereference or code jump) to get to your target. Analyze the situation and figure out what immutable truths there are you can exploit, and you too can avoid the evils of hashes.