Saturday, May 11, 2013

On Languages, VMs, Optimization, and the Way of the World

I shouldn't be up this late, but I've been doing lots of thinking and exploring tonight.

In studying various VMs over the past few years, I've come up with a list of do's and don't that make things optimize right. These apply to languages, the structures that back them, and the VMs that optimize those languages, and from what I've seen there's a lot of immutable truths here given current optimization technology.

Let's dive in.

#1: Types don't have to be static


JVM and other dynamic-optimizing runtimes have proven this out. At runtime, it's possible to gather the same information static types would provide you at compile time, leading to optimizations at least as good as fully statically-typed, statically-optimized code. In some cases, it may be possible to do a better job, since runtime profiling is based on real execution, real branch percentages, real behavior, rather than a guess at what a program might do. You could probably make the claim that static optimization is a halting problem, and dynamic optimization eventually can beat it by definition since it can optimize what the program is actually doing.

However, this requires one key thing to really work well.

#2: Types need to be predictable


In order for runtime optimization to happen, objects need to have predictable types and those types need to have a predictable structure. This isn't to say that types must be statically declared...they just need to look the same on repeat visits. If objects can change type (smalltalk's become, perl's and C's weak typing) you're forced to include more guards against those changes, or you're forced to invalidate more code whenever something changes (or in the case of C, you just completely shit the bed when things aren't as expected). If change is possible and exposed at a language level, there may be nothing you can do to cope with all those different type shapes, and optimization can only go so far.

This applies both to the shape of a type's method table (methods remaining consistent once encountered) and the shape of the type's instances (predictable object layout). Many dynamically-typed languages impose dynamic type shape and object shape on VMs that run them, preventing those VMs from making useful predictions about how to optimize code. Optimistic predictions (generating synthetic types for known type shapes or preemptively allocating objects based on previously-seen shapes) still have to include fallback logic to maintain the mutable behavior, should it ever be needed. Again, optimization potential is limited, because the shape of the world can change on a whim and the VM has to be vigilent

The alternative summation of #1 and #2 is that types don't have to be statically declared, but they need to be statically defined. Most popular dynamic languages do neither, but all they really need to do is the latter.

#3: You can't cheat the CPU


Regardless of how clever you'd like to be in your code or language or VM or JIT, the limiting factor is how modern CPUs actually run your code. There's a long list of expectations you must meet to squeeze every last drop of speed out of a system, and diverging from those guidelines will always impose a penalty. This is the end...the bottom turtle...the unifying theory. It is, at the end of the day, the CPU you must appease to get the best performance. All other considerations fall out of that, and anywhere performance does not live up to expectations you are guaranteed to discover that someone tried to cheat the CPU.

Traditionally, static typing was the best way to guarantee we produced good CPU instructions. It gave us a clear picture of the world we could ponder and meditate over, eventually boiling out the secrets of the universe and producing the fastest possible code. But that always assumed a narrow vision of a world with unlimited resources. It assumed we could make all the right decisions for a program ahead of time and that no limitations outside our target instruction set would ever affect us. In the real world, however, CPUs have limited cache sizes, multiple threads, bottlenecked memory pipelines, and basic physics to contend with (you can only push so many electrons through a given piece of matter without blowing it up). Language and VM authors ignore the expectations of their target systems only at great peril.

Let's look at a few languages and where they fit.

Language Scorecard


Java is statically typed and types are of a fixed shape. This is the ideal situation mostly because of the type structure being predictable. Once encountered, a rose is just a rose. Given appropriate dynamic optimizations, there's no reason Java code can't compete with or surpass statically-typed and statically-compiled C/++, and in theory there's nothing preventing Java code from becoming optimal CPU instructions.

Dart is dynamically typed (or at least, types are optional and the VM doesn't care about them), but types are of a fixed shape. If programmers can tolerate fixed-shape types, Dart provides a very nice dynamic language that still can achieve the same optimizations as statically-typed Java or statically-compiled C/++.

Groovy is dynamically typed with some inference and optimization if you specify static types, but most (all?) types defined in Groovy are not guaranteed to be a fixed shape. As a result, even when specifying static types, guards must be inserted to check that those types' shapes have not changed. Groovy does, however, guarantee object shape is consistent over time, which avoids overhead from being able to reshape objects at runtime.

Ruby and JavaScript are dynamically typed and types and objects can change shape at runtime. This is a confluence of all the hardest-to-optimize language characteristics. In both cases, the best we can do is to attempt to predict common type and object shapes and insert guards for when we're wrong, but it's not possible to achieve the performance of a system with fully-predictable type and object shapes. Prove me wrong.

Now of course when I say it's not possible, I mean it's not possible for the general case. Specific cases of a known closed-world application can indeed be optimized as though the types and objects involved had static shapes. I do something along these lines in my RubyFlux compiler, which statically analyzes incoming Ruby code and assumes the methods it sees defined and the fields it sees accessed will be the only methods and fields it ever needs to worry about. But that requires omitting features that can mutate type and object structure, or else you have to have a way to know which types and objects those features will affect. Sufficiently smart compiler indeed.

Python has similar structural complexities to Ruby and adds in the additional complexity of an introspectable call stack. Under those circumstances, even on-stack execution state is not safe; a VM can't even make guarantees about the values it has in hand or the shape of a given call's activation. PyPy does an admirable job of attacking this problem by rewriting currently-running code and lifting on-stack state to the heap when it is accessed, but this approach prevents dropping unused local state (since you can't predict who might want to see it) and also fails to work under parallel execution (since you can't rewrite code another thread might be executing). Again, the dynamicity of a "cool" feature brings with it intrinsic penalties that are reducible but not removable.

Get to the Damn Point, Already


So what am I trying to say in all this? I started the evening by exploring a benchmark post comparing Dart's VM with JVM on the same benchmark. The numbers were not actually very exciting...with a line-by-line port from Dart to Java, Java came out slightly behind Dart. With a few modifications to the Java code, Java pulled slightly ahead. With additional modifications to the Dart code, it might leapfrog Java again. But this isn't interesting because Dart and Java can both rely on type and object shapes remaining consistent, and as a result the optimizations they perform can basically accomplish the same thing. Where it matters, they're similar enough that VMs don't care about the differences.

Where does this put languages I love, like Ruby? It's probably fair to concede that Ruby can't ever achieve the raw, straight-line performance of type-static (not statically-typed) languages like Dart or Java, regardless of the VM technologies involved. We'll be able to get close; JRuby can, with the help of invokedynamic, make method calls *nearly* as fast as Java calls, and by generating type shapes we can make object state *nearly* as predictable as Java types, but we can't go all the way. Regardless of how great the underlying VM is, if you can't hold to its immutable truths, you're walking against the wind. Ruby on Dart would probably not be any faster than Ruby on JVM, because you'd still have to implement mutable types and growable objects in pretty much the same way. Ruby on PyPy might be able to go farther, since the VM is designed for mutable types and growable objects, but you might have to sacrifice parallelism or accept that straight-line object-manipulating performance won't go all the way to a Java or Dart. Conversely, languages that make those type-static guarantees might be able to beat dynamic languages when running on dynamic language VMs (e.g. dart2js) for exactly the same reasons that they excel on their own VMs: they provide a more consistent view of the world, and offer no surprises to the VM that would hinder optimization. You trade dynamicity at the language level for predictability at the VM level.

The Actual Lesson


I guess the bottom line for me is realizing that there's always going to be a conflict between what programmers want out of programming languages and what's actually possible to give them. There's no magical fairy world where every language can be as fast as every other language, because there's no way to predict how every program is going to execute (or in truth, how a given program is going to execute given a general strategy). And that's ok; most of these languages can still get very close to each other in performance, and over time the dynamic type/object-shaped languages may offer ways to ratchet down some of that dynamism...or they might not care and just accept what limitations result. The important thing is for language users to recognize that nothing is free, and to understand the implications of language features and design decisions they make in their own programs.

9 comments:

  1. what are your thoughts on luajit?

    ReplyDelete
  2. Perhaps it could be useful for a collection of objects in a dynamic language to be able to signal to the runtime "OK, I promise that we're done being unpredictable. You can start making assumptions about our shape now."

    Because, it seems many of the benefits of monkeying with a dynamically typed language come early in the run, at the point you set everything up; they enable nicer embedded DSLs, more expressive configuration. Then there comes a point where you could say "freeze".

    Are there examples (mainstream preferred, obscure still interesting) of this out there?

    ReplyDelete
  3. @anthonybailey.net This would certainly count as "obscure", but I have a WIP project which does something similar. See http://whitequark.org/blog/2012/12/06/a-language-for-embedded-developers/ and (for a historic perspective) http://whitequark.org/blog/2011/12/21/statically-compiled-ruby/.

    ReplyDelete
  4. How does Rust compare, in your view? www.rust-lang.org

    ReplyDelete
  5. Ruby and JavaScript are dynamically typed and types and objects can change shape at runtime. This is a confluence of all the hardest-to-optimize language characteristics. In both cases, the best we can do is to attempt to predict common type and object shapes and insert guards for when we're wrong, but it's not possible to achieve the performance of a system with fully-predictable type and object shapes. Prove me wrong.

    It's not possible to achieve the MAXIMUM ACHIEVABLE performance of a system with fully-predictable type and object shapes.

    Because V8 Javascript sure beats the crap out of lots and lots of systems with "fully-predictable type and object shapes" -- just because those systems didn't have the engineering talent and optimizations that it had.

    ReplyDelete
  6. Does JVM performs those optimizations ? :

    1. polimorphic inlining - https://codereview.chromium.org/14740005 - for deltablue benchmark http://www.dartlang.org/performance/

    This is are probable cause of recent improvements in dart vm that made it leap ahead of jvm on deltablue benchmark.

    2. allocation sinking - https://codereview.chromium.org/14935005/ - http://wiki.luajit.org/Allocation-Sinking-Optimization - for tracer benchmark http://www.dartlang.org/performance/

    It would be interesting to jvm vs dartvm on tracer...

    ReplyDelete
  7. Anonymous: I'm not sure what point you're trying to make.

    Is it that V8 performs better than some systems with predictable type and object structures that had fewer or less capable engineering resources on hand? That hardly seems like a realization.

    Is it that systems without predictable type and object structures can still be optimized well? I think I said as much during my post.

    Unknown: First off, I wouldn't say "leap ahead" is true at all. It was a bit faster on the stock version of this benchmark...in the neighborhood of 10-15%.

    WRT your two feature questions...it depends on the JVM. I know Hotspot's (OpenJDK's) capabilities reasonably well, so I'll answer based on that.

    1. Hotspot will inline up to 2 targets. Past that, the site is considered polymorphic and Hotspot will use other mechanisms to optimize the dispatch...but not inlining (I'm not sure if it still inlines the hot two targets).

    2. Hotspot has a fairly limited escape analyzer that works only when all paths through code can be followed. In other words, if there's any branch that's not followed during profiling or any method that doesn't get inlined, objects that *might* flow along those paths cannot be escape-analyzed away. This is probably the biggest issue for Hotspot right now.

    ReplyDelete
  8. Great post, thanks for sharing. One quick question: I am not sure what you mean when you say that PyPy's approach to dealing with stack frames hurts parallelism - under what circumstances (other than debugging) would one thread really need to access another thread's stack state (that has not been moved to the heap yet)?

    ReplyDelete
  9. Nikolay: The problem is not that one thread would need access to another's stack statte. The problem is that both threads are running the same code at the same time. You can't just arbitrarily rewrite that code to do something new if there are other threads running it, but that's the cornerstone of how PyPy does its deoptimization/reoptimization. A couple safe ways to do it would be to either have the current code branch into *new* code, which would generally require fixing up the stack and registers, or have all threads periodically check whether they should escape the current code in case it gets deoptimized or modified. At the moment, I'm not sure what the PyPy folks are planning to do.

    ReplyDelete