Hacker News new | past | comments | ask | show | jobs | submit login
Golang’s Real-Time GC in Theory and Practice (pusher.com)
203 points by signa11 on Dec 2, 2016 | hide | past | favorite | 66 comments



The weak generational hypothesis states that most objects are short-lived. Therefore concentrating on collecting the youngest objects yields the most garbage/cpu cycle spent.

In the benchmark, all objects survive the same amount of time and it is the oldest objects that become garbage first. It is the opposite of the behavior the weak generational hypothesis presumes. So collectors optimized for that scenario is at a disadvantage. :)

For example, I ported the benchmark to Python 3 and got a max pause time of 0.05 ms. That's 40 times faster than OCaml but it doesn't mean Python is faster in general. It just "got lucky" because ref counting is an almost optimal fit for the benchmark. There is no strategy that fits all when it comes to memory management. :)


Reference counting is great. I used to be skeptical but I'm a complete convert after using Objective-C.

The old myth about GC was that it was slow. Nowadays that's usually false, but what is true that it wastes a lot of memory, or has long pauses, or both.

The old myth about reference counting, on the other hand, is that uncollected cycles are a problem, but in practice they're usually not that hard to find and fix.

Reference counting is a bit slower, but it doesn't waste memory and doesn't pause. So it's great for throughput. And throughput is very often more important than overall performance!

For anything with a UI, you want maximum throughput. For anything on the server side that's realtime, or even indirectly user-facing, you want maximum throughput. The only time you want performance over throughput is batch jobs, which I reckon crop up less often than you expect.

(Edit to add: on memory-constrained devices, you might not want the memory overhead of GC even for batch jobs.)


Reference counting does not scale to multi-threaded programs well. Keeping reference counts up to date across multiple threads is maddeningly and frightenly hard to do efficiently. By the time you correctly solve the concurrent update problem for essentially every access the program does, you have a very complicated memory management system that incurs overhead on every single operation. Compare that with a GC that imposes overhead (usually) only on writes to objects, not "writes" to the stack or local variables going out of scope. Reference counting also incurs overhead for reading from objects. Empirical studies have shown that reads outnumber writes 10x to 1 usually. Why would you consider it to be just a _bit_ slower?

And of course reference counting pauses. Killing one object (its count dropping to zero) may cause a chain reaction of objects whose reference counts to drop to zero and need to be cleaned up. The difference is that you now suffer work proportional to the _dead_ data of the program. In comparison, a tracing GC does work proportional to the _live_ objects.


> Reference counting does not scale to multi-threaded programs well. Keeping reference counts up to date across multiple threads is maddeningly and frightenly hard to do efficiently.

Is this not just a matter of using an atomic reference count?


Yes, but atomics are pretty slow.


I guess, but that doesn't seem that bad.

> maddeningly and frightenly hard to do efficiently.

It definitely doesn't seem maddening or frightening. I'm not saying a GC isn't stricly better (because I'm unsure either way) but arc doesn't seem like some awful thing.


> Killing one object (its count dropping to zero) may cause a chain reaction of objects whose reference counts to drop to zero and need to be cleaned up.

That chain is trivial to interrupt and resume later. Compare that to interrupting any mark based GC.


Great explanation! Thanks!


You probably mean minimal latency, not maximum throughput. GC generally has higher throughput at the cost of latency and higher peak memory usage. :)


I've seen it repeated many times that GC has higher latency, but is there a benchmark? Imagine writing a Mac app, once with refcounting, and once with GC, and seeing which one has the longer pause time.

I can think of cases where GC has better latency, like deallocating a huge linked list. That will introduce a higher latency with refcounting and with manual deletion (like C) compared to GC, where the main thread doesn't need to block while traversing the list. Or does such a case not happen much in practice?


You mainly notice it in mobile apps, since you want silky smooth 60fps scrolling there. A pause longer than 5 or 10ms can easily cause you to drop a frame.


Hard to attribute that to GC.


Why hard? On Android, the GC will print a log entry telling you how long it took!

In my experience, aside from badly-written algorithms that just do too much work on the wrong thread, the main causes of momentary glitches are GC and I/O. Unfortunately you have limited control over those in application code (just using too much memory can cause some I/O churn). On mobile, the GPU drivers are often badly written and will give you random hiccups that can ultimately be traced back to I/O.

Not to turn this into an Apple vs Android battle, I hope, but I think this is an important point that is often overlooked -- Apple's hardware and software designs are really well-targeted to minimize those kind of glitches. They don't use GC and they always use very fast (AKA expensive) flash memory. So whatever else their flaws, Apple devices tend to have smoothly-animated UIs.


Oops, you're right! I think I had in mind something like "guaranteed throughput" but "consistent low latency" is definitely a better way to put it.


Parallel GC's throughput is limited by how hard the parallel thread is working as described in the article and which was the point.


But he's comparing to reference counting, where you pay a substantial throughput penalty round-tripping to RAM (especially if atomic) and don't get the benefit of a bump allocator.


This is an insightful comment. Thanks. I wonder how things would change if a second core was sitting idle anyway, like in a mobile app that does everything on the main thread. Would it then make sense to utilise the second core for greater performance? Imagine memory isn't a problem — some phones have plenty of memory, like 6GB on a OnePlus.

Unrelated to all the above, and related to your point about uncollected cycles not being a problem with refcounting, it does impose an overhead, like remember to use weak pointers to delegates in Objective-C. It's not as seamless to the programmer as GC. You can argue that it is (not) worth it, but there's no question it's more work for the programmer.


Apple's phones tend to have less memory, and fewer (but faster) cores. Along with Objective-C's reference counting, I think those are very smart choices for typical mobile app workloads.

In principle having lots of cores available is great, but in practice it's hard to utilise them effectively without also using tons of memory and/or battery.


There are cycle collecting reference counting algorithms, that due a limited trace within a subset of the object graph to detect and remove cycles.

I am still not clear on why Apple didn't go that direction.

Reference counting will still potentially pause though, when freeing large graphs of objects.


Yeah, but the pause is completely deterministically in control of your program.

Doing cycle collection basically ends up as hard as full GC right? Annotating with weak is not a big deal.


> Doing cycle collection basically ends up as hard as full GC right?

Not really. Yes you do need to have the ability to trace the object graph. But the actual algorithm isn't that bad.

I implemented this in C++ (for "fun" for a pet language VM i wrote) 15 years ago: http://researcher.watson.ibm.com/researcher/files/us-bacon/B...


That's interesting! I also implemented that algorithm for cycle collection (https://github.com/bjourne/c-examples/blob/master/libraries/...) and I found that like the GP states that it wasn't worth it. The problem was that there so many objects that were potentially part of garbage cycles that it required scanning a too large part of the heap.

Cycle detection took about 70% as long a full mark and sweep collection so I didn't think it was worth it considering how complicated the algorithm was. Is your implementation open source? If so, I'd love to see it and compare results.


I guess it's technically open source but it's old (15 year old C++) and based heavily on the pseudo-code in the paper and also tied heavily into the interpreter I was working on. I didn't do a lot of performance testing either.

The interpreter I was working on was for a language which was image-based and persistent, I guess much like a Smalltalk, etc. Reference counting was handy for aiding in determining when objects should be pages to/from the disk image and what should be in cache.

I think performance would depend a lot on the usage pattern of the object graph. My system also used a lot of immutable data. The only mutable structures were the objects that properties and methods hung off of, and hash tables.


> The pause is completely deterministically in control of your program.

In theory. In practice, the programmer may not be able to reproduce the pause to diagnose and fix it, or it may not bubble up to the top of the priority list. Whereas with a GC, this problem may not occur at all in the first place.

The question is whether such cases (large object graph or linked list) occur often enough in practice to be worth bothering about.


In practical terms, are there even many cases where cycles are desirable? The only case I can think of is threads (two threads communicating with each other). Most cycles I have seen even in GC'd languages were bad designs.



The weak generational hypothesis states that most objects are short-lived.

Doesn't this mostly apply to the classical OO style of coding that developed in Smalltalk and early Java? More recently, there have been styles of coding which emphasize pass by value in order to have better cache locality in generated code. The old style of "composition/organization by indirection" resulted lots of followed pointers, resulting in more cache misses, resulting in slower execution.

There is no strategy that fits all when it comes to memory management. :)

There is a nice overall strategy being followed by the Go community here. The emphasis in golang seems to be on pass by value when feasible, and keeping an eye out for reducing GC pressure. This is synergistic with GC optimized for concurrency and short pause times and also synergistic with good profiling tools for identifying the source of allocations.

This strategy makes it fairly easy to first make applications that work, then optimize them. The amount of mental effort is far less than that required to manage memory. The consequences for making a mistake are far less harsh, and correcting mistakes is fairly easy.


Well, it is an hypothesis and not a theorem so it hasn't been proven. But the tendency of most objects to be short-lived is universal and doesn't vary depending on programming language.

I believe you are thinking of objects which are allocated on the stack as not being objects. But they are just as much objects all objects allocated on the heap.

The difference is that stack-allocated objects have a very limited lifetime as they go out of scope when the procedure in which they are allocated returns. That makes allocating them extremely cheap. In fact, modern compilers perform optimizations called escape analysis to determine if the object allocated is only being used in the procedure. If so, it doesn't "escape" it and it can be allocated on the stack instead.

Go's value types means they are putting the burden on choosing whether to allocate on the stack or on the heap on the programmer. Often the programmer does a much better job at deciding those things than the compiler, which is good for performance. But it adds one extra thing that the programmer has to think about.

It's a classic engineering trade-off between language complexity and performance. I prefer languages that sacrifices performance for simplicity, but there is nothing wrong with Go's choice either.


I believe you are thinking of objects which are allocated on the stack as not being objects.

Of course they are still objects. However, those are objects that don't have to be collected, though they are still involved as "roots" or starting places for GC scan.

The difference is that stack-allocated objects have a very limited lifetime as they go out of scope when the procedure in which they are allocated returns. That makes allocating them extremely cheap.

Which is the whole point of styles that emphasize pass by value and allocating on the stack. Thanks for repeating the basics for 3rd parties that might find it useful.

Go's value types means they are putting the burden on choosing whether to allocate on the stack or on the heap on the programmer. Often the programmer does a much better job at deciding those things than the compiler, which is good for performance. But it adds one extra thing that the programmer has to think about.

Good paraphrase of what I said. Thinking about those things as one codes is kind of optional, in that the program will still be correct and will still run. It's probably best left to a later optimization stage, though there will still be some burden on the programmer to avoid pathological performance. In any case, it's less burden than outright memory management.


From my understanding, the weak generational hypothesis holds for the majority of programs in general. In fact, it is particularly true of language that encourage immutable data ( because new data is created rather than existing data updated). I don't see why it would apply particularly to OO languages, in fact, I would assume that passing pointers around would result in less short-lived objects due to less copying.


From my understanding, the weak generational hypothesis holds for the majority of programs in general.

But in Smalltalk like languages, the GC has to deal with "all of it," including objects that were allocated "on the stack." Because in most cases, outside of optimized cases like SmallInteger, my understanding is that objects allocated on the heap no matter what, and only that pointer reference was allocated on the stack.

It applied very well to Smalltalk. But there was a tendency to have objects for everything, and even a stack allocation meant a heap allocation.

I don't see why it would apply particularly to OO languages, in fact, I would assume that passing pointers around would result in less short-lived objects due to less copying.

Composition facilitated by object reference can result in styles where the granularity is small, so there are lots of objects. When a new user defined Object mans a new allocation on the heap, this means a lot throughput for the GC.


That is the crux of the issue. I think the interesting thing about this benchmark is the degree to which different GCs assume the weak generational hypothesis to be true. For example, the Haskell/GHC GC seems to very much assume it, with its STW copying collector for all generations. On the other hand you have languages like Ocaml and Go which still have a lot of pointer chasing to do in the old generation, but have optimisations (concurrent/incremental GC) which prevent the pauses getting too long in the case that the hypothesis is broken.

It's interesting to hear about the Python results. Ref counting makes absolute sense in this case!


Good writeup, with a surprise about OCaml. The authors welcome PRs for other languages. I look forward to seeing the results, and wonder if they will end up in a Techempower-style global brouhaha (which the Techempower folks have, I hasten to say, apparently handled with grace and integrity).



Thanks! If you're interested in OCaml's low-latency GC, there's a nice writeup about it by Jane Street: https://blogs.janestreet.com/building-a-lower-latency-gc/


If you have a long-lived functionality in your Go program that needs buffers, it may just be best to allocate a pool of buffers and reuse them. Go does offer a basic primitive for this with sync.Pool

I once wrote a high throughput UDP listener that maxed out all the cores to deal with receiving some bursty UDP traffic. I realized that interacting with the memory allocater was my bottleneck, so I just moved to a pool of UDP buffers. My ability to deal with load spikes improved substantially. There's no reason not to use Pools in general if you're building a service with a predictable and repeatable object lifetime


I'm a little surprised it didn't seem that they considered a GC-less language like Rust. If your latency requirements are that strict & you're already considering a total rewrite, I would personally have been looking into a solution that would avoid GC entirely.


This is something we get asked a lot. The short version is that we had other requirements as well as a low latency. I described some of these here: https://www.reddit.com/r/programming/comments/5fyhjb/golangs...


Rust has a semantic overhead and isn't quite as mature as Go. Their requirements don't necessitate a lack of GC pause.


I disagree with your first point, at least from a personal perspective. I find Rust more expressive than Go, and slightly more productive.

You're right about maturity though. Hopefully the Rust library ecosystem continues to make strides.


I agree: I personally find Rust more expressive than Go. I don't think it has any more "semantic overhead" than any other language with a highly featured type system.

To tie it back to the article a bit, Rust also seems like a relatively natural choice for a team considering switching from Haskell if their reason for switching is performance. The niche Rust fills for me personally is that it has a very robust type system (putting it in a similar camp as Haskell) with an explicit focus on fast, performant code (which Haskell does not have).


A great writeup and I feel it's a welcome reminder of some important points:

- your programming language and runtime aren't equally good at everything and never will be - you have to choose your environment based on your anticipated workload - you have to understand what your anticipated workload actually is - you can't just treat the garbage collector as a magic box that does the memory thing for you

Unless, of course, you're writing software which is nowhere near the performance boundaries of the system and never will be.

Can't say I share the author's surprise about the JVM's performance, but I never did like Java anyway.


We were surprised by Java's poor performance because the HotSpot JVM also uses a concurrent mark-and-sweep collector, and is famed for the amount of engineering effort that has gone into it. We suspect our benchmark could be improved by someone with more high-performance JVM experience.

[Edit] jcipar on Reddit made some promising progress by tuning the JVM GC parameters for our benchmark: https://www.reddit.com/r/programming/comments/5fyhjb/golangs...


This was a super insightful comment

The Java GC does not optimise for low latency exclusively, it tries to strike a balance between many different factors which your analysis entirely ignores! In particular it is compacting (Go's GC is not), which is very useful for long term stability as you can't get deaths due to heap fragmentation, and it tries to collect large amounts of garbage quickly, which Go's GC doesn't try to do (it's pure mark/sweep, not generational), and it tries to be configurable, which Go's GC doesn't care about. For many apps you do tend to want these things, they aren't pointless.

This is especially true because for most servers a 100msec pause time is just fine. 8msec is the kind of thing you need if you're doing a 60fps video game but for most servers it's overkill and they'd prefer to bank the performance.

To put this in perspective Google is having problems migrating to G1 because even though it gives lower and more predictable pause latencies, it slows their servers down by 10%. A 10% throughput loss is unacceptable at their scale (translates to a 10% increase in java server costs) and they want G1 to become even more configurable to let them pick faster execution at the cost of longer pauses


> This is especially true because for most servers a 100msec pause time is just fine.

I disagree (a bit). A 100 msec pause is perceptible by a human, and can lead a user to leave your website or you app, especially is the pauses are compounded among several serialized remote procedure calls.


Here's a simple practical guide to tune the Java GC properly: http://stackoverflow.com/a/37448602/5994461

The only thing you need to do nowadays is set "-Xmx__g -Xms__g" to configure the min and max heap size (replace __with a number).

There is only one rule to follow: The min and max heap must be the EXACT SAME number. (Basically, the heap must be fixed size).

If they're not, the GC will resize the heap during the program execution, which is a very expensive process that will freeze the application and ruin performances.

---

https://www.reddit.com/r/programming/comments/5fyhjb/golangs...

That comment is terrible advice. "-Xmx800m" then "-Xmx400m". He sets the max heap without setting the min, plus he runs the test for only a very short time not enough to stabilize the heap nor the optimizer. Bad bad bad.


> There is only one rule to follow: The min and max heap must be the EXACT SAME number. (Basically, the heap must be fixed size). If they're not, the GC will resize the heap during the program execution, which is a very expensive process that will freeze the application and ruin performances.

I doubt this is true of the JVM GC. Adaptive heap size tuning is also in V8 and it doesn't require an expensive operation like a full GC to accomplish (e.g. the whole heap is not copied during this operation). The GC can simply add more pages of memory after a minor GC and promote some of the objects from the young gen to the old gen. So it doesn't have to be more expensive than a young gen (minor) GC.


It is true. One of the reddit comment states that they divided the latency by 5 after they reconfigured the min and max heap properly and let the program run for longer.

The GC is in a constant battle to keep the program running within memory bounds.

If you make it think that the program can run in 20MB by not giving a setting, it will try to fit the program in 20MB and clean the heap all the time (till it gives up and grow the heap).

The GC needs to know it's bound to it doesn't bother running when not necessary AND it can decide how much space is available to keep short/long lived objects.


The HotSpot GC does compaction, while Go's doesn't IIRC. Compaction is the most difficult part to make concurrent. If you're just measuring pause times, then HotSpot will be at a disadvantage, as it's doing more.


Wait, Go doesn't compact?

Is it safe to assume that they have to deal with fragmentation or does Go solve that another way?


Here is what Ian lance Taylor has to say about fragmentation[1]

"Heap fragmentation is not normally a concern with the current Go runtime. The heap is implemented to that large pages are divided up into blocks that are all the same size. This means that the usual definition of fragmentation--a small block prevents coalescing of large blocks--can not occur. "

1. https://groups.google.com/d/msg/golang-nuts/Ahk-HunIqgs/1sOi...


That's not any different than most C malloc implementations, it certainly helps with fragmentation but doesn't eliminate it as a concern. You can still have a single allocation in a block that'll hold the whole block hostage.

I'm used to systems that vary from 8-512mb of total system ram with no virtual memory. Techniques that work well on the desktop/server explode spectacularly when brought to our platforms.


I believe that fragmentation isn't such a problem in Go programs because they make less use of the heap than Java programs. I can't remember where I read this though.

I don't know much about the internals of Java, but I know that Go will stack allocate any kind of object if the compiler can prove it doesn't escape, and that in Go lots of things (not just native types) are passed by value on the stack instead of by reference to a heap value.


> I don't know much about the internals of Java, but I know that Go will stack allocate any kind of object if the compiler can prove it doesn't escape

This is also the case in HotSpot.


Sure, I guess I'm just floored that you'd have a system that doesn't compact and has no manual mechanisms for allocation.

I guess it's just a different class of problems that I'm used to dealing with. Fragmentation is something I've seen many times and it usually likes to rear it's head at the worst possible moment.


> has no manual mechanisms for allocation

Golang supports value types for objects, so arrays (or something higher level like sync.Pool) work great for doing arena allocation.


Isn't GC parameter tuning equivalent tuning model parameters and hence susceptible to over fitting? As soon a you patch your service, the parameters may need to change.


Not only GC parameter tuning, _all_ parameter tuning. In many cases, technologies have matured enough so that you can build a mental model of them that fits reality both now and in a couple of years time, but that is never guaranteed. For example, in the small:

- the performance of a bit shift used to be dependent on the number of bits shifted (https://wiki.neogeodev.org/index.php?title=68k_instructions_...). Timing of integer and floating point multiplication was dependent on argument values.

- on PowerPC, for at least some CPUs, it could be worthwhile to use floating point variables to iterate over an integer range because that kept the integer pipeline free for use doing the actual work. Change the CPU, and you have to change the type of your loop variable.

In the large:

- the optimal code for lots of data-crunching code is hugely dependent on cache size, sizes and relative speeds of caches, main memory, and disk.

- if you swap in another C library, performance (for example of scanf or transcendental functions) can be hugely different.

- upgrading the OS may significantly change the relative timings of thread and process creation, changing the best way to solve your problem from multi-process to multi-threaded or vice versa.

So yes, GC parameter tuning is a black art that, ideally, eventually should go away, but the difference with other technologies is only gradually, and, at least, you can tune it without recompiling your code.

I also think GC is getting mature enough for fairly reliable mental models to form. The throughput/latency tradeoff that this article mentions is an important aspect; minimizing total memory usage may be another.


Another example, the pile of switches usually available in C and C++ compilers.


jcipar managed to get the max pause down to 22ms by tweaking parameters: https://www.reddit.com/r/programming/comments/5fyhjb/golangs...

He also mentioned that the JVM GC does a lot of online tuning, so the max pause times may drop over a longer run of the program. This is similar to the Racket GC, where the maximum pauses are >100ms at the start of the run, but converge to around 20ms as the program continues to run.

It would be nice to run the benchmarks for a longer period of time, and only measure max pause times once this "ramp up" period is over.


The need to tweak parameters is a big weakness in Java's GC. If Go's GC now performs as good or better "out of the box" than Java's does after expert tweaking, that's fantastic.


Yeah it takes a few thousand calls for the JVM to "warm up". It's surprisingly noticeable on speed and latency, but I'm less knowledgeable about the GC.


I think we need to modify the benchmark to give enough time for the RTS to "warmup" before timing max pause times. This will provide a fairer comparison, since the end goal is to evaluate languages for web servers with large heaps and low latency requirements.


I mean having a short period of time with long latency still sucks though too, right? That's real customers getting real low-quality service.

Then maybe you're left doing complex things like running the server with fake traffic as a warming up period to get it to be fast enough.

Sure, it'll work, but it's a lot of extra hassle.


You'd probably use load balancing to send some traffic to new images while they warm up. At lower volumes, you shouldn't be hitting GC pauses mid request. Then after a certain point you point everything to the new version. This is good for QA too.


depending on what you're doing, maybe you consider this complexity no big deal.

but i don't think, for eaxmple, a standard setup using aws elb and autoscaling would support this.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: