Another Year, Another Interpreter

It's been approximately a year since I launched into my first redesign of the JRuby interpreter engine, and a lot has happened since then. We've gotten Rails working as well as improving compatibility to the point where most other pure Ruby apps just work. We've filled out the base set of extensions and have active projects to complete what's left. We've built up a very active and intelligent community, all of whom are helping by contributing bug reports and patches. And perhaps most importantly for the project, Tom and I have been given the privilege of working on JRuby full time.

But none of that was because of the interpreter.

JRuby Interpreter History 101

The original JRuby interpreter, written by the original authors (Jan Arne Petersen and others), followed a pretty standard Visitor pattern. The AST nodes were visited in turn, and on each callback appropriate actions were performed. It was a perfect demonstration of Visitor in action, but the additional calls required to visit each node in turn badly impacted the Java call stack. At its lowest, before I began my work, JRuby on Windows under Java 1.4.2 could recurse no deeper than 300 levels in a simple single-recursive fib algorithm. And this while Ruby itself could go into the thousands.

However stack depth alone wasn't a good enough reason to work on a new interpreter. There were various ways we could clean up the stack size without a complete rework. Unfortunately, there were two Ruby features that at the time meant a new interpreter would be necessary: green threads and continuations.

Green Rubies

Ruby currently supports only green threading. The threading model is very simple; one native thread, the main process, pumps the entire interpreter. Once threads come into play, a signal or timer is set up using OS-specific calls. A new thread means a new set of global variables; variables that point into the thread's call stack, variable scopes, and current execution point. The signal/timer, once initiated, triggers a potential thread switch every 10ms by setting a global switch. On certain boundaries during execution, that switch is checked, and if it is set a call is made into the thread scheduler to set up the next thread for execution.

In short, a 10ms timeslicing green thread scheduler. And I believe that thread context switching is *voluntary*...it would be possible in a C extension or through specific Ruby scripts to prevent context switching at all.

But I digress. Green threading does carry with it some benefits. Because you are not dependent on the operating system for initialization and cleanup of thread data, you can manipulate threads much more easily. Ruby supports operations that native threading libraries forbid, like killing existing threads or stopping them indefinitely. Now it would not be such a problem if these operations were used sparingly, but during our testing we found that many libraries depend on them. So then we had a dilemma: how to support unsafe threading operations with safe native threads?

The initial solution, and the one we currently have in place, works reasonably well but could certainly be improved. On the same boundaries that the C implementation checks for a context switch, we check a set of runtime-global variables and locks. Various "unsafe" thread events trigger those locks and variables to change state. Kill sets a kill flag; upon encountering it the native JRuby thread will throw a ThreadKill exception. Entering a critical section sets a critical flag on all threads; when they reach it, they'll freeze in their tracks until the critical thread has completed. For the most part, the Ruby code behaves as it does in MRI, though backed by native threads. It's not foolproof, but it's sneaky and good enough for now. Emulation and simulation. Subterfuge and legerdemain.

A New Hope

When I launched into my redesign a year ago, I hoped to enable a different solution. I wanted to build a purely stackless Ruby interpreter that could support continuations and green threads. Because it would still run on top of a native-threaded VM, I planned to also support m:n threading, so JRuby could scale through a full range of threading options. And even though it was a very large task, the basic building blocks were already necessary: continuations would not be possible while deepening the Java stack (without really nasty tricks), and so green threading came along for the ride.

The first step was to make the interpreter stackless. This I mostly achieved by hacking the existing visitor into individual pieces. Each method that had previously been called during visitation was encapsulated in an Instruction implementation. The state associated with the interpreter was put into an EvaluationState object. Each instruction, when called, would receive an instance of EvaluationState and an InstructionContext that was usually just the AST node being traversed. The visitor now became a factory for Instructions; as it visited the AST nodes, instructions were pushed down into an instruction stack in the EvaluationState object; the previous deepening of the Java stack being emulated via a soft stack. As traversal went deeper, both the visitor and the instructions themselves would push more instructions down onto the stack. Execution meant popping the top instruction, running it, and returning back to the interpreter loop. An instruction could affect the state of the runtime or change the flow of execution by conditionally pushing further instructions on the stack. It was a fairly clever way to directly translate the old code into a stackless design.

As of this past spring, perhaps 80% of the AST nodes were fully represented with what I called "collapsed" instructions. This basically meant that for much of the AST, execution could proceed without the Java stack deepening. The tradeoff was in the additional overhead of maintaining a soft stack in the EvaluationState and a considerably more complicated interpreter design. But it worked, and after several months of tweaking it was faster than the original. Even better, it greatly improved our maximum stack depth.

The next step, if had we taken it, was to begin the painful task of making the rest of the interpreter stackless. The design I had in mind would have called for all Java-implemented methods to trampoline; in other words, calling a method would mean that whether it made an additional call or returned, control came back to the caller. The caller, in that design, would have been the overarching ThreadContext. Execution of a thread would begin by pointing a new ThreadContext at a starting node in the AST. As the thread executed, the stackless interpreter would maintain the Java stack depth at the ThreadContext level. Calls into Java-based methods would dispatch back to ThreadContext for deeper calls, so that any deepening of the Ruby stack did not actually deepen the Java stack. At any given time, ThreadContext could be frozen; the native thread pumping it could then be passed off to another ThreadContext, or used for other purposes. And of course, continuations would have been trivial to implement with this model, since the current running state of a ThreadContext could be saved at any time.

Whew! It was an ambitious goal, but with the mostly-stackless interpreter in place we were well on the way. I even had designs for how to compile Ruby code into stackless Java methods...so stackless that you could even pull off a continuation in the middle of a call. Like magic!

A Fork in the Road

So what happened? Sometime around JavaOne we heard about the Ruby KaiGi in Japan, a Ruby conference or get-together of some sort. If RubyConf is the big conference for us Westerners, this at least provided a mid-year update for English-speaking Rubyists. Matz was there, Koichi was there, and I believe other Ruby dignitaries made the trip as well.

And then Matz and Koichi dropped the bomb: Ruby 2.0 would support neither continuations nor green threads.

As many of you know, Koichi's next-generation Ruby interpreter engine, YARV, has been coming along in fits and starts. Because of the desire to keep existing extensions working and because of a lack of resources to work on Ruby internals, YARV has had many difficult problems to overcome. How do you write a next-generation interpreter without actually changing how the runtime works? Is it possible to do it without a new threading implementation, a new memory manager, a new garbage collector? Can you keep all your internal APIs exposed to C extensions and successfully migrate to a new interpreter design? I think the answer is that yes, you can, but it's really, really hard. And along with the announcement about continuations and green threads came the YARV/Rite (Ruby 2.0 + YARV) beta timeline: no earlier than Christmas 2007.

So we were faced with a decision. Do we continue along the much more complicated path toward green threads and continuations, knowing that they will eventually be unsupported by the language? Is it worth the effort to support them now if they'll go away?

We stewed over this decision for a week or two. We had been making all the right moves toward a stackless design, and were on the cusp of launching into the next set of battles. But it was a painful process.

Pragmatism

Eventually we decided that under the circumstances, the Ruby and JRuby communities would be better served by having a solid, native-threaded, continuation-free implementation on the JVM. We were unable to find any use of continuations in the most popular applications, and it seemed that very few people had a good reason to use them. In addition, there was growing concern over Ruby's lack of support for native threads; so there too was a opportunity for JRuby to shine.

Very little of the interpreter changed over the summer. We worked furiously in our spare hours getting Rails running, and were able to present it in a primitive form at JavaOne. I started work on a number of traditional compiler designs for JRuby that showed potential gains of 50-75% over the existing code. And most importantly, our community, compatibility, and test library all began to grow rapidly.

Toward the end of the summer, it became likely that we would join Sun Microsystems as full-time JRuby developers. Sun had wisely taken an interest in Ruby along with other scripting languages, and because we had shown some of the potential for Ruby on the JVM, they asked us to join their team. And just four weeks ago, I became a Sun employee.

JRuby Now and Into the Future

So where has all this led? JRuby has been getting more and more attention from folks within Sun, Rubyists around the world, and especially from Java developers anxious to escape from their Java-only prisons. Our compatibility is increasing faster than before; we've had over a hundred new bugs reported in the past few weeks...almost all of them with community-contributed patches. We have added our first non-Sun team member Ola Bini, a star of the JRuby community who has proven his dedication to making Ruby on the JVM succeed. And we have started to solidify our short-term goals for the project.

The primary goal remains the same: JRuby should be as close to 100% compatible with Ruby 1.8 as possible. Today, we are doing extremely well in this department. Rails generally works without issues, and most pure Ruby applications run without modification. There's still plenty of edge cases to iron out, but we're moving very rapidly now. Getting Rails to work as well as it does is already a major achievement.

We have also started to iron out what a JRuby 1.0 release should look like. A few major points come up again and again:
  • Compatibility should be such that we can safely claim "Rails is supported"
  • Java integration should look like we want it to look for the future, and should be performant, lightweight, and seamless
  • All major codebase refactorings should be complete; this includes a solid design for wiring up Java-based method implementations, external extensions, and IO channels
  • Unicode should be supported out-of-the-box, giving Ruby code access to everything Java is capable of
  • Threading should work perfectly, both for JRuby-launched threads and for "adopted" threads from outside the runtime
  • Performance should be markedly improved
This last bullet may sound a little less ambitious than the others. Performance has been a major concern of mine, but it's also the most difficult goal to quantify. If JRuby runs Rails almost as fast as Ruby, is that enough? Well, perhaps, but Rails is a very IO-intensive application, so it's not a particularly good measure of core JRuby performance. If JRuby is many times slower than Ruby for certain command-line tools, should a release be delayed? Perhaps, but it's not so cut-and-dried; a JRuby that's slow on the command-line may perform exceedingly well once hosted on a long-running server.

There is one piece of the puzzle, however, that performance neatly excludes: vastly complicated stackless green-threaded continuable interpreter engines.

Late-Night Hacking

We had basically made the decision not to continue down the green threading path this summer, but we had made no effort to reverse the work I had already done. We always suspected that a more traditional engine would perform better, but attempting another redesign was a very large task nobody wanted to tackle.

Last night I got bored of hunting around for juicy performance tidbits. I wanted to tackle something bigger.

About 9:00PM on Thursday, I started rewriting the interpreter engine to be a more straightforward switch-based affair. Instead of Instruction stacks and EvaluationState, or even Visitors, it would be a fast, concise switch statement, recursing as the AST deepened and affecting state changes along the way. I felt a certain sadness ripping up the old interpreter; I had spent weeks on that code last year, and was proud of what I'd accomplished. It's no small feat to turn a recursive, visitor-based interpreter into a stackless, instruction-based machine. Unfortunately, I knew that design would never serve JRuby the way we needed it to. It was designed for another purpose, and its complexity and cost gained us very little in the long term. And as of 9:00PM Friday, after working furiously for 24 hours, it was gone.

The new interpreter is, as I stated, a large switch statement. Each AST node is assigned an int, so the main "eval" call can quickly determine the correct code to execute for each. The new design does result in faster deepening of the Java stack (about a 15-20% hit to fib max-depth), but it still performs far better than the original visitor-based implementation. And although we'd only theorized up to this point, it does perform a good bit better than the Instruction-based engine.

Rake Installation Performance

Pretty much the first thing anyone does with Ruby is install some application or tool they need. And nine times out of ten, they install it with RubyGems.

One of our favorite benchmarks is a local RubyGems install of Rake, Ruby's answer to make. For reasons which are not yet clear, the documentation-generator RDoc--which is called as part of a RubyGem installation--performs extremely poorly under JRuby. RDoc is basically implemented as a series of source-code parsers that generate documentation based on special comment tags embedded in the code. And it's written in pure Ruby, so it's a good benchmark for JRuby.

We've been steadily improving performance. The following sets of numbers show our progress, all the way up through teh new interpreter:

JRuby 0.9.0, stackless interpreter, Java 6:
real 1m 58.465s
user 2m 1.671s
sys 0m 2.625s

JRuby 0.9.1 current trunk code, stackless interpreter, Java 6:
real 1m 10.488s
user 1m 13.075s
sys 0m 2.013s

JRuby headius branch, new interpreter, Java 6:
real 1m 0.489s
user 1m 5.220s
sys 0m 1.849s


What's Next for Performance

The numbers above spell it out pretty clearly; we've managed to double performance since the 0.9.0 release at the beginning of the summer. We've also managed to do that without writing a compiler and with a vast number of optimizations still on the table. Things are looking very good. Among future performance-enhancing changes:
  • Reducing the cost of method calls. Currently a given method call generates an absurd amount of transient objects, like arrays copied into ArrayLists unwrapped into arrays, again and again. We must clean up the call path to minimize object churn.
  • Building more pre-optimization into static structures in the system. Many areas of the system are cloned or regenerated again and again without ever changing. Blocks, for example, have only 5 mutable fields...but we clone the entire block for every invocation. By saving off the static bits of the system once, we lower the cost of all operations.
  • ThreadContext is still alive and well; it has always represented the Ruby thread within the JRuby runtime, holding stacks for framing calls, scoping variables, and managing blocks. Unfortunately, the only way to access it is through a moderately expensive threadlocal call, and believe me we hit that call hard. Part of the new interpreter design helps limit those calls; further work in the rest of the runtime will help eliminate them.
  • The bytecode compiler *will* happen. It's been on hold primarily because the runtime is still evolving. Because even compiled Ruby code will likely still have ties to the JRuby runtime, we need to first iron out how the runtime should look and act long-term.
I've had a great time this past year working on JRuby, and things are really starting to pick up speed. I'm sure the next year's going to be even crazier...especially once JRuby 1.0 comes around the corner.

Thanks all for your support. I'm having a blast.
Written on October 7, 2006