Thursday, January 29, 2009

My Favorite Hotspot JVM Flags

I probably start up a JVM a thousand times a day. Test runs, benchmark runs, bug confirmation, API exploration, or running actual apps. And in many of these runs, I use various JVM switches to tweak performance or investigate runtime metrics. Here's a short list of my favorite JVM switches (note these are Hotspot/OpenJDK/SunJDK switches, and may or may not work on yours. Apple JVM is basically the same, so these work).

The Basics

Most runs will want to tweak a few simple flags:
  • -server turns on the optimizing JIT along with a few other "server-class" settings. Generally you get the best performance out of this setting. The default VM is -client, unless you're on 64-bit (it only has -server).
  • -Xms and -Xmx set the minimum and maximum sizes for the heap. Touted as a feature, Hotspot puts a cap on heap size to prevent it from blowing out your system. So once you figure out the max memory your app needs, you cap it to keep rogue code from impacting other apps. Use these flags like -Xmx512M, where the M stands for MB. If you don't include it, you're specifying bytes. Several flags use this format. You can also get a minor startup perf boost by setting minimum higher, since it doesn't have to grow the heap right away.
  • -Xshare:dump can help improve startup performance on some installations. When run as root (or whatever user you have the JVM installed as) it will dump a shared-memory file to disk containing all of the core class data. This file is much faster to load then re-verifying and re-loading all the individual classes, and once in memory it's shared by all JVMs on the system. Note that -Xshare:off, -Xshare:on, -Xshare:auto set whether "Class Data Sharing" is enabled, and it's not available on the -server VM or on 64-bit systems. Mac users: you're already using Apple's version of this feature, upon which Hotspot's version is based.
There are also some basic flags for logging runtime information:
  • -verbose:gc logs garbage collector runs and how long they're taking. I generally use this as my first tool to investigate if GC is a bottleneck for a given application.
  • -Xprof turns on a low-impact sampling profiler. I've had Hotspot engineers recommend I "don't use this" but I still think it's a decent (albeit very blunt) tool for finding bottlenecks. Just don't use the results as anything more than a guide.
  • -Xrunhprof turns on a higher-impact instrumenting profiler. The default invocation with no extra parameters records object allocations and high-allocation sites, which is useful for finding excess object creation. -Xrunhprof:cpu=times instruments all Java code in the JVM and records the actual CPU time calls take. I generally only use this to profile JRuby internals because it's extremely slow, but it's also much more accurate than -Xprof.
Deeper Magic

Eventually you may want to tweak deeper details of the JVM:
  • -XX:+UseParallelGC turns on the parallel young-generation garbage collector. This is a stop-the-world collector that uses several threads to reduce pause times. There's also -XX:+UseParallelOldGC to use a parallel collector for the old generation, but it's generally only useful if you often have large numbers of old objects getting collected.
  • -XX:+UseConcMarkSweepGC turns on the concurrent mark-sweep collector. This one runs most GC operations in parallel to your application's execution, reducing pauses significantly. It still stops the world for its compact phase, but that's usually quicker than pausing for the whole set of GC operations. This is useful if you need to reduce the impact GC has on an application run and don't mind that it's a little slower than the full stop-the-world versions. Also, you obviously would need multiple processors to see full effect. (Incidentally, if you're interested in GC tuning, you should look at Java SE 6 HotSpot Virtual Machine Garbage Collection Tuning. There's a lot more there.)
  • -XX:NewRatio=# sets the desired ratio of "new" to "old" generations in the heap. The defaults are 1:12 in the -client VM and 1:8 in the -server VM. You often want a higher ratio if you have a lot more transient data flowing through your application than long-lived data. For example, Ruby's high object churn often means a lower NewRatio (i.e. larger "new" versus "old") helps performance, since it prevents transient objects from getting promoted to old generations.
  • -XX:MaxPermSize=###M sets the maximum "permanent generation" size. Hotspot is unusual in that several types of data get stored in the "permanent generation", a separate area of the heap that is only rarely (or never) garbage-collected. The list of perm-gen hosted data is a little fuzzy, but it generally contains things like class metadata, bytecode, interned strings, and so on (and this certainly varies across Hotspot versions). Because this generation is rarely or never collected, you may need to increase its size (or turn on perm-gen sweeping with a couple other flags). In JRuby especially we generate a lot of adapter bytecode, which usually demands more perm gen space.
And there are a few more advanced logging and profiling options as well:
  • -XX:+PrintCompilation prints out the name of each Java method Hotspot decides to JIT compile. The list will usually show a bunch of core Java class methods initially, and then turn to methods in your application. In JRuby, it eventually starts to show Ruby methods as well.
  • -XX:+PrintGCDetails includes the data from -verbose:gc but also adds information about the size of the new generation and more accurate timings.
  • -XX:+TraceClassLoading and -XX:+TraceClassUnloading print information class loads and unloads. Useful for investigating if you have a class leak or if old classes (like JITed Ruby methods in JRuby) are getting collected or not.
Into The Belly

Finally here's a list of the deepest options we use to investigate performance. Some of these require a debug build of the JVM, which you can download from java.net.

Also, some of these may require you also pass -XX:+UnlockDiagnosticVMOptions to enable them.
  • -XX:MaxInlineSize=# sets the maximum size method Hotspot will consider for inlining. By default it's set at 35 *bytes* of bytecode (i.e. pretty small). This is largely why Hotspot really like lots of small methods; it can then decide the best way to inline them based on runtime profiling. You can bump it up, and sometimes it will produce better performance, but at some point the compilation units get large enough that many of Hotspot's optimizations are skipped. Fun to play with though.
  • -XX:CompileThreshold=# sets the number of method invocations before Hotspot will compile a method to native code. The -server VM defaults to 10000 and -client defaults to 1500. Large numbers allow Hotspot to gather more profile data and make better decisions about inlining and optimizations. Smaller numbers reduce "warm up" time.
  • -XX:+LogCompilation is like -XX:+PrintCompilation on steroids. It not only prints out methods that are being JITed, it also prints out why methods may be deoptimized (like if new code is loaded or a new call target is discovered) and information about which methods are being inlined. There's a caveat though: the output is seriously nasty XML without any real structure to it. I use a Sun-internal tool for rendering it in a nicer format, which I'm trying to get open-sourced. Hopefully that will happen soon. Note, this option requires -XX:+UnlockDiagnosticVMOptions.
And finally, my current absolute favorite option, which requires a debug build of the JVM:
  • -XX:+PrintOptoAssembly dumps to the console a log of all assembly being generated for JITed methods. The instructions are basically x86 assembly with a few Hotspot-specific instruction names that get replaced with hardware-specific instructions during the final assembly phase. In addition to the JITed assembly, this flag also shows how registers are being allocated, the probability of various branches being followed (along with multiple assembly blocks for the different paths), and information about calls back into the JVM. Outside the logging options for the final generated assembly (which requires a separate plugin) this is the best tool for discovering what optimizations are actually happening. I use this at least a couple times a week to investigate JRuby performance enhancements.
And So Much More

Hotspot has literally hundreds of different flags (and here's another list specific to Java 6), and dozens of them that might be useful to you. I may add a few more to this post as I remember them, but this list includes all those I use on a regular basis. If you're using JRuby, you can use the -J flag to pass any of these flags through to the JVM, as in -J-XX:+PrintCompilation.

What are some of your favorite Hotspot JVM flags?

Update: Another couple that commenters added or reminded me of:
  • Marcus Kohler commented on -XX:+HeapDumpOnOutOfMemoryError, useful if you have a slow-leaking application you can't pin down. It will dump heap information to disk whenever there's an OutOfMemoryError, allowing you to do offline analysis.
  • j6wbs mentioned that you can send SIGQUIT (or hit Ctrl+Backslash or Ctrl+Break in the console) to dump the current execution stack of all running threads. This is especially nice if you have a runaway app or if an app appears to have frozen.
  • karld offers up -XX:OnOutOfMemoryError="mail -s 'OOM on `hostname` at `date`' whoever@example.com <<< ''" as a way to send out email when there's an OutOfMemoryError. Poor-man's monitoring!
  • I also remembered a very important option for JRuby: -Xbootclasspath specifies classpath entries you want loaded without verification. The JVM verifies all classes it loads to ensure they don't try to dereference an object with an int, pop extra entries off the stack or push too many, and so on. This verification is part of the reason why the JVM is very stable, but it's also rather costly, and responsible for a large part of startup delay. Putting classes on the bootclasspath skips this cost, but should only be used when you know the classes have been verified many times before. In JRuby, this reduced startup time by half or more for a simple script. Use -Xbootclasspath/a: and -Xbootclasspath/p: to append and prepend to the default bootclasspath or -Xbootclasspath: to completely set your own.

17 comments:

  1. Marcus: Good one. And also related to that the 'jmap' tool for remotely examining or dumping the heap, and the 'jhat' tool for analyzing that heap and serving up (via localhost http) a set of pages for browsing heap information.

    And of course jconsole, which gives you a remote management console for the JVM with threading, memory, GC, and other management information and tools.

    ReplyDelete
  2. Great post, except the entry for -XX:+UseConcMarkSweepGC should have read:

    DO NOT USE.

    Really. It's completely broken for a real app.

    ReplyDelete
  3. Taylor: Can you elaborate? I know several folks running with CMS in production without problems.

    ReplyDelete
  4. Agreed
    -XX:+UseConcMarkSweepGC

    works pretty well at least on the SAP JVM Hotspot:)

    ReplyDelete
  5. My favorite one is: -XX:++ForGodsSakeStartupThisAppAsFastAsNativeApss!!!111!!

    ;-)

    ReplyDelete
  6. Have you talked to any of your colleagues about dtrace? Maybe you're not using Solaris by default, but it might be worth it.

    ReplyDelete
  7. Behrang: Actually there's a magic flag that's almost as good as that. Once you're using shared class data (-Xshare) if you have a large application that can run out of the bootstrap classloader, you can use -Xbootclasspath to load the app without costly classloader verification. In JRuby, this cut our startup time in half. I think I'll add it to the main article.

    Bill: We'd love to use DTrace, but I think we really just need someone to sit down and show us how. Or perhaps a book/tutorial recommendation?

    ReplyDelete
  8. @Charles,

    We've seen that in production high load usage, ConcMarkSweepGC works fine for about two weeks as advertised (give or take considering load and actual use case), and then goes into a pathological stop-the-world GC that can last anywhere from 30 minutes to 2 hours depending on heap size.

    We suspect this is due to accumulated heap fragmentation that ultimately results in the need for the stop the world collection to clean everything out.

    Also although we cannot confirm it, anecdotally we seem to have run into more JVM kernel crashes with it on than off.

    ReplyDelete
  9. "Or perhaps a book/tutorial recommendation?"

    You work for Sun right? I would suggest that you offer beers to the guys who wrote it for some of their time. ;-)

    ReplyDelete
  10. Charles, I used to use -Xverify:none to skip the verification and I am not still satisfied with that. Does -Xbootclasspath reduce the startup time even more?

    ReplyDelete
  11. @Taylor,

    Does the 2 hour pause occur for a 4gig heap size or does it occur for larger (smaller?) heap sizes?

    ReplyDelete
  12. @taylor .. Did you open a case on the cms behavior and the other crashes? Let me know.

    ReplyDelete
  13. Peter: Thanks for that, I had not tried tiered compilation yet myself. This is only in recent OpenJDK 7, yes?

    ReplyDelete
  14. Note that
    there's
    -noverify to disable verifying the bytecodes.
    Always use this for Eclipse

    ReplyDelete
  15. Charles: Tiered compiler is in the later JDK6 (at least since u10) as well as OpenJDK7.

    ReplyDelete
  16. How to do data cache misses for a Java program? Any idea?
    Thanks a lot

    ReplyDelete
  17. So...what is you typical "go to" command line for running jruby apps as fast as possible using hotspot?

    ReplyDelete