Thursday, August 31, 2006

Performance: Inlining Strings

A while ago, we decided to inline all appropriate symbolic strings as they entered the AST. This appeared to help performance a measurable amount, presumably for two reasons:
  1. The AST would take up less space in the long term
  2. Since Strings cache their hashcodes, having each identical string in the AST be the same object would reduce the number of hash calculations.
And the numbers seem to bear it out. Here's a local rake install:

Without interning AST strings:
real 1m19.894s
user 1m18.649s
sys 0m0.920s

With interning AST strings:
real 1m15.021s
user 1m13.785s
sys 0m0.948s

So it's very this case 4-5 seconds out of 80 seconds, or about a 6% gain with interning.

However this week I realized the fallacy of the second point above. In order to intern each incoming string, Java must hash them. This causes all strings entering the AST to be hashed once anyway in order to get the interned version. In fact, it could reduce performance, since we're now forced to hash strings we might never encounter during execution.

I can't confirm that this is 100% true, but it's a reasonable theory. String.intern() is native code, but logic dictates that the fastest way to intern strings would be to use an internal hash/symbol table. So I proceeded this evening to try removing interning to see how it affected performance. I hoped to get a small gain during execution due to a large gain during parsing. The numbers above show that I lose a measurable amount overall by interning. However, I do see substantial parse performance gains:

With interning AST strings ( - t method of measuring 100 parses):

Without interning AST strings:

About 1.7 seconds out of 11, or a whopping 15% gain in parse speed without interning. AARGH.

At this point I'm leaning toward removing interning and swallowing the 6% performance hit temporarily until we can figure out where it's coming from. Logically, interning everything should only add overhead, or so my brain tells me. Where's the flaw in my logic?


  1. Isn't a bigger issue that those interned strings won't be garbage collected?

    1. I agree, I just analyzed a dump of our app and the ruby string "" alone is taking 65mb of memory. If all our ruby strings are interned, then our apps memory usage after startup will go from 500mb to 350mb.