Java Regex Broken for Alternation over Large Strings

This is a serious bummer.

I've been working through some of Rails's test cases this evening. Since I've got Depot running reasonably well now and I know I can easily wire together a simple Rails app that calls Hibernate or EJBs, I figured I'd try to get a better handle on the true state of Rails support.

As I'm running test cases--the majority of which appear to be passing, btw--I came across a couple goofy regex issues. The first is one we'd seen before, where dot refuses to match \r. In this case, adding DOT_ALL to the Pattern's flags seems to resolve it. I think we just hadn't noticed that flag before, so many other similar bugs might be resolved the same way. Not an easy issue to narrow down, but I found it and moved on.

So then I started seeing SystemStackErrors popping up within the CGI library. When I see stack overflows in JRuby, I usually figure it's one of two things:
  1. Something's wrong with JRuby that's breaking a termination condition in Ruby code, so a recursion is never ending
  2. JRuby itself is missing a termination and recursing endlessly
Pretty simple issues to track down...find where the recursion is happening, and figure out why it's not terminating. Except in this case, the recursion wasn't in either Ruby or JRuby code. It was in Java's regex library.

Blowing The Stack

O Java, how I love Thee. Every library I could want, all in one place. You care for me so well. It pains me to find your faults.

So it appears that the Java regex library's compiled patterns execute recursively. This alone wouldn't be particularly unusual, especially if the recursion was limited by the complexity of the regex itself. However in this case it seems the recursion is a factor of the string to match, rather than of the expression complexity. With a large enough input string, I can get Java's regex to blow the stack every time.
regex = /\A((?:.|\n)*?)?([\r\n]{1,2}|--)/n
long_string = ""
76.times { long_string += "xxxxxxxxxxxxxxxxxx" }

long_string.sub(r) {""}

This bit of Ruby code will cause java.util.regex.Pattern to blow the stack during matching. I'm sure this regex could be reduced to a simpler case, but I don't really need to go through the exercise of doing so; others have reported the same issue:
So the issue stands, unfortunately. I'm not sure I'd characterize it as an RFE...I'm no regex expert, but I would hope my regex engine wouldn't overflow based on large input, especially if that input was completely bogus as in my example above. I don't think that's unreasonable, especially considering that Ruby's regex engine appears to handle this case (and considerably larger cases) just fine. Correct me if I'm expecting too much.

So what's the damage? Well, since we currently use java.util.regex exclusively, and since Ruby libraries and apps very frequently use regex to match against very large strings, we'll have to migrate to a different regex library. Until that happens, however, Rails under JRuby will not support large multipart posts, at a minimum. There may be other places this has an effect, but multipart posting was the area I investigated.

So for now...I guess it's just busted. We'll begin evaluating alternative regex libraries posthaste. Any help from you folks would be appreciated...just take that regex or one from the bugs and try to match against extremely large input.

Rewrite the Regex?

Some may say that the regex should be rewritten to avoid the use of alternation. Perhaps so, perhaps there's a better way. However Ruby handles it fine, so requesting that the Rails folks or any other Ruby devs start tailoring their regex so they will work under JRuby is quite out of the question. If we can't run the same regex against the same input, we need a different regex library...that's all there is to it.

Besides, the bugs posted have additional regex that are even simpler. It just seems to be a limitation of the Java regex library that certain regex will blow the stack on large input.

For the record, here's the relevant bit of the stack. I get pages and pages of this, under both Java 5 and Java 6 beta 2...both 64-bit under Linux:
...
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4241)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4111)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:3962)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3314)
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4241)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4111)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:3962)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3314)
at java.util.regex.Pattern$Branch.match(Pattern.java:3998)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4052)
...

What a downer.

Update (2011-3-1): Someone stumbled across this post recently and I realized I'd never updated it with a solution for this problem. I checked out several alternative Java regex engines. Some had the same problem, being implemented the same way, but JRegex both resolved the issue and performed very well. I'd recommend it to anyone that needs to work around the problems in Java's built-in regex library.
Written on September 4, 2006