To Multibyte, Or Not To Multibyte

We've been wrestling with parser speed this past week on the JRuby project, tweaking the lexer, fiddling with the grammar and parser generator, and micro-optimizing all the various support classes. None of those experiments have helped much; performance in each case improved by only a few percentage points.

We've also been wrestling with the issue of Unicode support, since Java supports it well and Ruby does not. We're caught between worlds here, not wanting to create an incompatible Ruby but realizing the absurdity of our lacking Unicode support under Java.

It seems that solutions for the two issues may be mutually exclusive.

Character Pain

After a recent speed comparison between Java, Ruby, C, and a few other languages erupted on the ruby-talk mailing list, it became quickly apparent how expensive writing UTF-16 character sequences to a single-byte encoding can be. The best optimization of the program on that thread pre-encoded and cached all strings to be written (none dynamically generated) as byte[], saving the cost of encoding them later during a stream write. Because of that version's success, I started to wonder what might be the cost of reading and writing char versus byte in Java. The results surprised me.

When I first suggested this comparison on the JRuby dev list, Ola Bini quickly tested out a version of the lexer that used only streams and byte[], rather than readers and Strings. With no other optimizations, that change improved the overall parse performance by almost 20% (Java 5 on Windows x86). Shocking, to say the least.

Given that surprising speed boost, I thought I'd run a few microbenchmarks on reading and writing bytes and characters.

The Test

The source files come in two flavors:

yes.txt, an ISO-8859-1-encoded file filled with "y" characters
yes2.txt, the same file encoded in UTF-16.

Eight scenarios were tested:

reading bytes straight out of the file
reading characters straight out of the file
buffered reading of bytes straight out of the file
buffered reading of characters straight out of the file
reading bytes from a byte array
reading characters from a byte array
writing bytes to a byte array
writing characters to a byte array

I didn't play with various types of streams much; I mainly just ran with a few basic ones I'm familiar with. If there's an optimal way to perform each of these scenarios, please let me know.

Each cycle was run 1000 times, reconstructing streams and readers each time. Each cycle read the equivalent of 10 million characters, which in the case of yes2.txt meant reading 20 million bytes.

All tests were run on an Opteron 150, 2.6GHz, running 64-bit Linux and 64-bit Java 5.

Results: yes.txt, ISO-8859-1, 10 million characters (10 million bytes)

I first ran against the single-byte version:

1000 direct byte reads from file: 22435
1000 direct char reads from file: 112372
1000 buffered byte reads from file: 22625
1000 buffered char reads from file: 112594
1000 buffered byte reads from array: 9477
1000 buffered char reads from byte array: 107975
1000 buffered byte writes to array: 8556
1000 buffered char writes to byte array: 16198

Ouch. In both buffered and unbuffered direct reads from a file, characters fare rather poorly, taking over five times as long. Note that buffering here didn't really help, since filesystem IO is apparently not a limiting factor on this machine.

Notice also how little character reads improved from an in-memory byte array. In this case, I had the code read from an InputStreamReader wrapped around a ByteArrayInputStream. It's certainly possible this number would improve if simply passing the byte[] directly to a String constructor, but the current code seems far slower than I expected.

Not terribly surprising is how much better character writes to a byte array performed. Down-encoding from UTF-16 to a single-byte encoding--especially when we're dealing with all ASCII characters--is pretty cheap. Still, it took twice as long.

Results: yes2.txt, UTF-16, 10 million characters (20 million bytes)

1000 direct byte reads from file: 44717
1000 direct char reads from file: 126998
1000 buffered byte reads from file: 44595
1000 buffered char reads from file: 126401
1000 buffered byte reads from array: 17423
1000 buffered char reads from byte array: 122082
1000 buffered byte writes to array: 17893
1000 buffered char writes to byte array: 57915

Here the character reads fare better, but not by much. While the byte reads took twice as long (duh, we're reading twice as many bytes) the character reads have increased by only about 10%. Since the work done for character reads should be a superset of the work done for byte reads, this shows that it's obviously faster reading from UTF-16 into UTF-16. Unfortunately any speed gains are wiped out when we have to read twice as much data.

The write numbers are confusing, and could indicate an error in my test. Where the byte writes doubled in length, the character writes have almost quadrupled. Either I'm doing something wrong or someone else is. If anything, I would have expected the performance of character writes to decrease no more than the performance of byte writes, since no down-encoding was now necessary. And if I had wired the test wrong and down-encoding is actually occurring, the numbers should have matched the single-byte file.

How This Affects JRuby

MRI (Matz's Ruby Interpreter) currently has poor support for Unicode, mostly cobbled together from various community projects. Ruby 2.0 promises support for every string encoding possible, including Unicode encodings and many others, but we're unlikely to see it for well over a year. Because JRuby runs on Java, us toeing the line and also avoiding Unicode support simply doesn't make sense. As much as we'd like to avoid diverging from MRI, many Javaists simply can't use JRuby effectively without Unicode.

A number of different schemes have been discussed for supporting Unicode in JRuby. Some are based on the Ruby 2.0 plans, or as far as we can take them without causing incompatibility with Ruby 1.8, its libraries, or applications written for it. Some leverage the fact that our Ruby String implementation is using Java's UTF-16 String, simply allowing incoming files to be in any encoding and allowing the parser to work with full UTF-16 characters rather than with our present 0xFF-masked byte-in-a-char. Still others propose we support multibyte encodings, but only in literal strings...which matches Ruby 2.0 plans to only allow single-byte-encoded identifiers in code, but any encoding for embedded literal strings.

The simplest to support, obviously, is to just allow Java to handle decoding the incoming stream, possibly allowing a pragma line (Ruby 2.0-style) to specify a specific encoding. While reading, we handle the pragma and set the remainder of the file to read with the given encoding into full UTF-16 Strings. This achieves the primary goal of Unicode string literals, but has the side effect of allowing Unicode identifiers, something which is so far not supported for Ruby 2.0.

The Ruby 2.0-ish way to handle encodings would be to read the file in as a single-byte encoding first, only using specialized encodings when encountering string literals. Say what you want about that method; I won't comment on its quality, but I will say it would be considerably more difficult for us to implement, and I'm not sure how you would embed non-ASCII-compatible string literals into an ASCII-compatible script file.

I am leaning toward the full Unicode support, where incoming files can be any encoding Java supports and all text can use the full complement of UTF-16-compatible Unicode characters. The compatibility with existing Ruby code is apparent: almost everything out there right now is in an ASCII-compatible format, which we'd be able to support without any work at all. However JRuby scripts that use Unicode characters would almost certainly be incompatible with MRI if any of those characters require multiple bytes; it would be impossible, for example, to take a UTF-16 encoded JRuby file and run it under MRI without modification.

So What?

There are two conflicting goals here: performance and Unicode support.

On the performance front, we would like to always read, parse, and store simple bytes, rather than paying the thunk cost for every character. Perhaps more serious and drastic, we'd like to use a byte[]-based UTF-8 String implementation internally, since Ruby uses String as a general-purpose byte-buffer (for which we currently pay the thunk cost on every read or write operation). The cost of using all characters internally, when everything else comes in the form of bytes, is apparent from the benchmark numbers.

On the Unicode front, we'd like seamless, Java-style Unicode support without quirks or gotchas. We'd like to continue using Java's String internally, and do all our parsing through readers. We would have to suck up the (sometimes large) thunk cost, but we'd have arguably the best Unicode support of any Ruby implementation currently available. We would unfortunately also then support writing scripts that are incompatible with MRI.

What To Do?

All these numbers and all these ideas boil down to a few key questions:

Is the ability to create incompatible scripts for JRuby a showstopper? Is it enough to warn people that we support Unicode more fully than MRI, but that support comes at a price? Is full Unicode support more important than backward-compatibility for JRuby scripts under Ruby 1.8 (or even forward-compatibility for JRuby scripts under Ruby 2.0 as currently specified)?
Is there anything that can be done about the dismal performance of byte-to-char thunking? It worries me for parsing, but worries me even more for our String implementation, which uses the char-based StringBuffer internally as a byte buffer for all Ruby's IO operations. Are parse and Ruby IO performance more important than full Unicode support? Should we hobble JRuby for (perhaps large) performance gains?

I'm anxious to solve both issues; but we may end up having to choose one or the other. However if we could resolve the character-thunking performance issue, the answer would be clear.

Update: The source code for the test, as it was run, is available here.

Written on August 18, 2006