A great writeup and I feel it's a welcome reminder of some important points:
- your programming language and runtime aren't equally good at everything and never will be
- you have to choose your environment based on your anticipated workload
- you have to understand what your anticipated workload actually is
- you can't just treat the garbage collector as a magic box that does the memory thing for you
Unless, of course, you're writing software which is nowhere near the performance boundaries of the system and never will be.
Can't say I share the author's surprise about the JVM's performance, but I never did like Java anyway.
We were surprised by Java's poor performance because the HotSpot JVM also uses a concurrent mark-and-sweep collector, and is famed for the amount of engineering effort that has gone into it. We suspect our benchmark could be improved by someone with more high-performance JVM experience.
The Java GC does not optimise for low latency exclusively, it tries to strike a balance between many different factors which your analysis entirely ignores! In particular it is compacting (Go's GC is not), which is very useful for long term stability as you can't get deaths due to heap fragmentation, and it tries to collect large amounts of garbage quickly, which Go's GC doesn't try to do (it's pure mark/sweep, not generational), and it tries to be configurable, which Go's GC doesn't care about. For many apps you do tend to want these things, they aren't pointless.
This is especially true because for most servers a 100msec pause time is just fine. 8msec is the kind of thing you need if you're doing a 60fps video game but for most servers it's overkill and they'd prefer to bank the performance.
To put this in perspective Google is having problems migrating to G1 because even though it gives lower and more predictable pause latencies, it slows their servers down by 10%. A 10% throughput loss is unacceptable at their scale (translates to a 10% increase in java server costs) and they want G1 to become even more configurable to let them pick faster execution at the cost of longer pauses
> This is especially true because for most servers a 100msec pause time is just fine.
I disagree (a bit). A 100 msec pause is perceptible by a human, and can lead a user to leave your website or you app, especially is the pauses are compounded among several serialized remote procedure calls.
The only thing you need to do nowadays is set "-Xmx__g -Xms__g" to configure the min and max heap size (replace __with a number).
There is only one rule to follow: The min and max heap must be the EXACT SAME number. (Basically, the heap must be fixed size).
If they're not, the GC will resize the heap during the program execution, which is a very expensive process that will freeze the application and ruin performances.
That comment is terrible advice. "-Xmx800m" then "-Xmx400m". He sets the max heap without setting the min, plus he runs the test for only a very short time not enough to stabilize the heap nor the optimizer. Bad bad bad.
> There is only one rule to follow: The min and max heap must be the EXACT SAME number. (Basically, the heap must be fixed size).
If they're not, the GC will resize the heap during the program execution, which is a very expensive process that will freeze the application and ruin performances.
I doubt this is true of the JVM GC. Adaptive heap size tuning is also in V8 and it doesn't require an expensive operation like a full GC to accomplish (e.g. the whole heap is not copied during this operation). The GC can simply add more pages of memory after a minor GC and promote some of the objects from the young gen to the old gen. So it doesn't have to be more expensive than a young gen (minor) GC.
It is true. One of the reddit comment states that they divided the latency by 5 after they reconfigured the min and max heap properly and let the program run for longer.
The GC is in a constant battle to keep the program running within memory bounds.
If you make it think that the program can run in 20MB by not giving a setting, it will try to fit the program in 20MB and clean the heap all the time (till it gives up and grow the heap).
The GC needs to know it's bound to it doesn't bother running when not necessary AND it can decide how much space is available to keep short/long lived objects.
The HotSpot GC does compaction, while Go's doesn't IIRC. Compaction is the most difficult part to make concurrent. If you're just measuring pause times, then HotSpot will be at a disadvantage, as it's doing more.
Here is what Ian lance Taylor has to say about fragmentation[1]
"Heap fragmentation is not normally a concern with the current Go runtime. The heap is implemented to that large pages are divided up into blocks that are all the same size. This means that the usual definition of fragmentation--a small block prevents coalescing of large blocks--can not occur. "
That's not any different than most C malloc implementations, it certainly helps with fragmentation but doesn't eliminate it as a concern. You can still have a single allocation in a block that'll hold the whole block hostage.
I'm used to systems that vary from 8-512mb of total system ram with no virtual memory. Techniques that work well on the desktop/server explode spectacularly when brought to our platforms.
I believe that fragmentation isn't such a problem in Go programs because they make less use of the heap than Java programs. I can't remember where I read this though.
I don't know much about the internals of Java, but I know that Go will stack allocate any kind of object if the compiler can prove it doesn't escape, and that in Go lots of things (not just native types) are passed by value on the stack instead of by reference to a heap value.
Sure, I guess I'm just floored that you'd have a system that doesn't compact and has no manual mechanisms for allocation.
I guess it's just a different class of problems that I'm used to dealing with. Fragmentation is something I've seen many times and it usually likes to rear it's head at the worst possible moment.
Isn't GC parameter tuning equivalent tuning model parameters and hence susceptible to over fitting? As soon a you patch your service, the parameters may need to change.
Not only GC parameter tuning, _all_ parameter tuning. In many cases, technologies have matured enough so that you can build a mental model of them that fits reality both now and in a couple of years time, but that is never guaranteed. For example, in the small:
- on PowerPC, for at least some CPUs, it could be worthwhile to use floating point variables to iterate over an integer range because that kept the integer pipeline free for use doing the actual work. Change the CPU, and you have to change the type of your loop variable.
In the large:
- the optimal code for lots of data-crunching code is hugely dependent on cache size, sizes and relative speeds of caches, main memory, and disk.
- if you swap in another C library, performance (for example of scanf or transcendental functions) can be hugely different.
- upgrading the OS may significantly change the relative timings of thread and process creation, changing the best way to solve your problem from multi-process to multi-threaded or vice versa.
So yes, GC parameter tuning is a black art that, ideally, eventually should go away, but the difference with other technologies is only gradually, and, at least, you can tune it without recompiling your code.
I also think GC is getting mature enough for fairly reliable mental models to form. The throughput/latency tradeoff that this article mentions is an important aspect; minimizing total memory usage may be another.
He also mentioned that the JVM GC does a lot of online tuning, so the max pause times may drop over a longer run of the program. This is similar to the Racket GC, where the maximum pauses are >100ms at the start of the run, but converge to around 20ms as the program continues to run.
It would be nice to run the benchmarks for a longer period of time, and only measure max pause times once this "ramp up" period is over.
The need to tweak parameters is a big weakness in Java's GC. If Go's GC now performs as good or better "out of the box" than Java's does after expert tweaking, that's fantastic.
Yeah it takes a few thousand calls for the JVM to "warm up". It's surprisingly noticeable on speed and latency, but I'm less knowledgeable about the GC.
I think we need to modify the benchmark to give enough time for the RTS to "warmup" before timing max pause times. This will provide a fairer comparison, since the end goal is to evaluate languages for web servers with large heaps and low latency requirements.
You'd probably use load balancing to send some traffic to new images while they warm up. At lower volumes, you shouldn't be hitting GC pauses mid request. Then after a certain point you point everything to the new version. This is good for QA too.
- your programming language and runtime aren't equally good at everything and never will be - you have to choose your environment based on your anticipated workload - you have to understand what your anticipated workload actually is - you can't just treat the garbage collector as a magic box that does the memory thing for you
Unless, of course, you're writing software which is nowhere near the performance boundaries of the system and never will be.
Can't say I share the author's surprise about the JVM's performance, but I never did like Java anyway.