Thursday, April 3, 2008

Shared Data Considered Harmful

One of the big reasons people have been interested in JRuby lately is because it is currently the only Ruby implementation that can both host multiple instances in a single process and run threads *actually* in parallel. Ruby 1.8 is entirely green-threaded. Rubinius has support for multiple parallel in-process VMs, but each one is still green-threaded. Ruby 1.9 uses native threads, but they're almost always blocking on a giant lock. IronRuby uses native CLR threads, but can't host multiple instances in a single process. JRuby FTW.

Of course true parallel execution is not without its pitfalls, and we stumbled across one this week. No, it wasn't a lock contention issue or data corruption or anything you're thinking of. It was smaller, sneakier, and much lower-level.

Have a look at this code:
public class Trouble {
public static int i = 1;
public static int firedCount = 0;

public static int totalSize = 1000000000;
public static int maxLoops = 5;

public static void main(String[] args) {
final int threadCount = Integer.parseInt(args[0]);
Thread threads[] = new Thread[threadCount];

try {
for (int loops = maxLoops; loops > 0; loops--) {
firedCount = 0;
long start = System.currentTimeMillis();

for (int j = 0; j < threadCount; j++) threads[j] = new Thread() {
public void run() {
int total = totalSize / threadCount;
for (int k = 0; k < total; k++) {
if (((i += 1) & 0xFF) == 0) doNothing();
}
}
};

for (int j = 0; j < threadCount; j++) {
threads[j].start();
}

for (int j = 0; j < threadCount; j++) {
threads[j].join();
}

System.out.println("time: " + (System.currentTimeMillis() - start));
System.out.println("fired: " + firedCount);
}
} catch (Exception e) {
e.printStackTrace();
System.exit(1);
}
}

public static void doNothing() {
firedCount++;
}
}
It looks pretty straightforward, right? Given a number on the command line, it fires up that many threads, divides the overall loop count by the number of threads, and starts them all running. Each thread increments the same static field 'i', and if ever i % 256 == 0, they call doNothing which increments a hit counter. Simple!

This is obviously a pretty dumb microbenchmark, but it mirrors a similar piece of code in JRuby. Because Ruby 1.8 supports various "unsafe" threading operations, we must also support them. But since we can't kill a target thread or raise an exception arbitrarily in a target thread, we must periodically checkpoint. Every thread checks whether one of those events has fired at various points during execution. The full list is describe on the RubySpec wiki page on Ruby threading, but the bit we're interested in is the check that happens every 256 method calls. In JRuby, this was implemented with a static counter, and every time a call was made the counter was incremented and checked.

So because JRuby has the ability to *actually* run in parallel, a group at Sun has been working on some microbenchmarks to test our threaded scalability. The benchmark is fairly simple: given a array of strings, reverse each string and replace it in the array. When running with n threads, divide the array into n segments to parallelize the work. We'd naturally expect that performance would improve as threads were added up to the number of cores in the system. But we discovered a problem we never expected to see: threaded scaling completely tanked.

A Surprising Effect

Here's the result of running the Java benchmark on Java 5 with one and then two threads:
~/NetBeansProjects/jruby ➔ java -server Trouble 1
time: 3924
fired: 3906250
time: 3945
fired: 3906250
time: 1841
fired: 3906250
time: 1882
fired: 3906250
time: 1896
fired: 3906250
~/NetBeansProjects/jruby ➔ java -server Trouble 2
time: 3243
fired: 4090645
time: 3245
fired: 4100505
time: 1173
fired: 3906049
time: 1233
fired: 3906188
time: 1173
fired: 3906134
This is on my Core Duo MacBook Pro. With one thread, overall CPU usage is at around 60-65%. Spinning up the second thread consumes the remaining 35-40%, and so we see an improvement in the overall execution time. Yay parallel execution.

But now have a look at the Java 6 numbers on the same machine:
~/NetBeansProjects/jruby ➔ java -server Trouble 1
time: 1772
fired: 3906250
time: 1973
fired: 3906250
time: 2748
fired: 3906250
time: 2114
fired: 3906250
time: 2294
fired: 3906250
~/NetBeansProjects/jruby ➔ java -server Trouble 2
time: 3402
fired: 3848648
time: 3805
fired: 3885471
time: 4145
fired: 3866850
time: 4140
fired: 3839130
time: 3658
fired: 3880202
Woah, what the hell happened here? Not only is overall performance worse, the two-thread run is significantly slower than the single thread. And we saw the exact same effect in JRuby: two threads running a given benchmark performed markedly worse than a single thread...exactly the opposite effect we'd like to see.

Caching 101

Source: CPU Cache article on Wikipedia

Accessing main memory during execution is slow...so slow that all modern processors have multiple levels of cache in which to cache memory locations that are used repeatedly. There's cache on the processor itself, cache in fast memory chips located near the processor, and so on. The cache, in whatever form, is a smaller, faster memory store. Within the cache, data is organized into cache lines, segments of data that map to specific locations in main memory. So if your processor uses cache lines of 16 bytes, the cache will hold some number of 16-byte entries for the most recently or most frequently-used data from memory.

Writes to memory also benefit from caching. Whether a cache is write-through or not, most writes do not immediately go to main memory for the same performance reasons: main memory is slow. So if you have a single processor accessing a small range of memory, rapidly reading and writing data, it may operate almost entirely from the cache. To help this process along, most compilers will try to organize memory locations likely to be accessed temporally near each other such that they'll be loaded into nearby memory locations; the more you can jam into each cache line, the more likely you'll make good use of available cache memory.

Of course at some point, the cache becomes stale, and writes must flushed and reads must be refreshed. This can happen (for example) when there's a cache miss, and the program accesses a memory location not already cached. In this case, one of the other cache entries must be evicted to make room for the new one. It will also happen naturally as the program progresses or as new programs and threads are scheduled to execute...eventually the cache must be flushed simply to ensure that main memory has been updated and all lines of execution see the change. But there's another point at which the cache or (more likely) an individual cache line might be flushed

Cache Line Ping-Pong

In a multiprocessor/multicore machine, you will usually have two or more threads actually running in parallel. Depending on the architecture, each core usually has its own cache, allowing the two threads to run independently without influencing each others' work. Of course things started to get interesting when you have multiple threads accessing the *same* memory locations, as we have in this case.

This brings us to the reason for Java 6's poor performance on the benchmark above. John Rose helped me out by disassembling the resulting native code for this benchmark on Java 5 and Java 6. It turned out that while Java 5 located the two static fields in different quadwords (16 bytes), Java 6 located both fields in the same quadword. This optimization allows better cache utilization...since the two fields are immediately next to each other in memory, they're almost certain to share a cache line. But it also has a different effect.

Cache line ping-pong happens when two threads executing against different caches need to read and write the same memory locations. Because the two threads need to see each others' changes, the cache line ends up bouncing back and forth between the two caches, generally causing a flush or refresh from main memory in the process.

Main memory is slow, remember? Because of the cache line ping-pong, adding threads for this benchmark actually caused things to slow down. Every new thread increased the amount of back-and forth between caches and main memory up to the number of cores...at which point, performance remained about constant for additional threads (because the amount of ping-pong was no worse).

It's interesting to note that single-thread performance was also considerably worse on Java 6. I don't yet have a good explanation for this.

The Fix

For the Java benchmark above, given what we now know about cache lines and such, a simple fix that Vladimir Sizikov found was to move one of the two statics to a different class. Java 6 only aligned within a single quadword the two fields when on the same class; moving one elsewhere presumably results in their living on different cache lines and the amount of ping-ponging is reduced.

In JRuby, where we had only a single counter but a lot of other stuff happening around it, the answer was much clearer: we split the single global counter into several per-thread counters and the effect was entirely erased. So the string-reversal benchmark started to scale nearly-linearly with additional threads, while Ruby 1.8, Ruby 1.9, and Rubinius showed a slight performance degradation as threads were added. JRuby FTW.

The Moral

I suppose there's a few things we can learn from this:
  1. Don't use statics - Gilad Bracha probably said it best in his article "Cutting out Static":
    Static variables are bad for for concurrency. Of course, any shared state is bad for concurrency, but static state is one more subtle time bomb that can catch you by surprise.
    Touché, Gilad.
  2. Don't share state unless you absolutely have to - This is perhaps a lesson we all believe we know, but until you're bitten by something sneaky like this you may not realize the what bad habits you have.
  3. One man's optimization is another man's deoptimization - Java 6 is significantly faster than Java 5, presumably because of optimizations like this. But every optimization has a tradeoff, and this one caught me by surprise.

12 comments:

  1. @krukow In this case, the accuracy of the counter was not important enough to warrant the overhead of synchronization on the fields. It was a quick and dirty "every-256-calls" counter intended to fire roughly on that schedule. If it missed firing once or fired twice in a shorter timeframe, it wasn't important.

    AtomicInteger would certainly be more thread-safe where accuracy is a requirement, but it would introduce the overhead of synchronization. So I'd expect the bottleneck to move from cache ping-ponging to lock contention. In this case, that overhead was totally unnecessary since the accuracy requirements of the counter are very low.

    ReplyDelete
  2. Christian SeilerApril 3, 2008 at 5:02 AM

    Avoiding statics may be a good thing, but IMO this sounds like an "enhancement opportunity" for JDK6, too.

    And a side note: When I read the subject of this post first, I thought the post would be about sharing data between JRuby instances, which I find to be something extremely useful when doing JRoR (like EhCache and stuff like that)

    ReplyDelete
  3. Apropos of nothing, this sounds like an excellent argument for a functional (immutable) approach to development: if you can only share read-only data, you aren't going to get into this kind of trouble. The significance of that system simplification never ceases to amaze me.

    ReplyDelete
  4. Timely - I just committed very similar changes to a Perl program with the same problem (always fun when your stats code is 10x slower than the actual workload).

    I'm not familiar with the JRuby internals but how hard would it be to trigger a runtime error if you write to anything which isn't explicitly thread-local without synchronization? It'd be too slow for normal use but a --see-if-I-was-as-smart-as-I-thought flag would be great for testing.

    ReplyDelete
  5. @headius

    We call that middle ground "Ocaml". :)

    ReplyDelete
  6. christian seiler: As would I! Unfortunately I don't have much time day to day to monkey around with Rails and caching and other stuff...so it won't be coming from me.

    robert fischer: That's certainly true. And there are several Java APIs that have withstood the test of time better because they produced only immutable data structures. I think there has to be a middle ground between functional programming's fully immutable objects and OO's frequent need for mutable state that doesn't harm programmers' ability to get shit done.

    chris adams: It would certainly be possible, but I'm not sure how useful it would be given how little people use synchronization primitives in Ruby. If that flag brings up a million warnings in libraries you're calling, is it going to be useful?

    ReplyDelete
  7. I think better visibility would be useful - and it would be convenient to start encouraging more thought to thread-safety - similar to the way gcc's -Wall hasn't prevented a lot of bad/risky code from being widely used but it has allowed people who care to clean things up.

    ReplyDelete
  8. I agree with krukow. You should at least add a disclaimer to this entry to alert those less familiar with concurrency that the code is incorrect. You say it is ok if a few updates are missed. The problem is there are no guarantees at all about what is going to happen. In theory you could get expected total/number of threads.

    ReplyDelete
  9. Re: Joe Armstrong

    Erlang actor's actually have piles of shared state (much more than a simple counter), so don't think that they would be faster.

    ReplyDelete
  10. Does making this variable volatile have any effect on the generated native code in this case?

    ReplyDelete
  11. The right "fix" for this code is to synchronize around all the references to the static variables.

    I hate it when people say synchronization is slow and try to come up with ways to beat the system - that's a recipie for disasterously unmaintainable code.

    The synchronize keyword in Java is not just about preventing two threads from running the same code, it is also about making sure that memory is synchronized between multiple processors, as this code demonstrates.

    My philosophy is, I don't care how fast it is if it doesn't work right.

    ReplyDelete
  12. Charlie, AtomicInteger uses non-blocking techniques when it is possible to do so on the platform. Lock contention is not necessarily an issue.

    Also, as flarosa says, If you are going to have a shared counter at all (I think it is better to avoid them altogether as you ultimately have done), please use synchronization of some kind. It isn't just an issue of the counter getting a little behind; the door is open to all sorts of problems at that point, like the counter never resetting properly, so you end up thrashing because it keeps immediately hitting 256.

    Once you leave the bounds of what the memory model promises, any illusion that your code "works" is just that -- an illusion.

    -mental

    ReplyDelete