Hacker News new | past | comments | ask | show | jobs | submit login
A Fast, Minimal Memory, Consistent Hash Algorithm (arxiv.org)
129 points by brandonb on March 15, 2015 | hide | past | favorite | 26 comments



Woah, this jumps out at me immediately since Eric Veach's name is on it. I haven't seen a (relatively) new Eric Veach paper in a long long time.

Some background: Eric Veach is huge in the computer graphics world for laying a ton of the foundations of modern physically based rendering in his PhD thesis [1]. He then went on to work for Pixar and did a ton of work on Renderman (for which he recently got an Academy Award), and then in the early 2000ish left Pixar to go work for Google, where he was the lead on developing AdWords [2]. In short, he's had quite a career, and seeing a new paper from him is always interesting.

[1] https://graphics.stanford.edu/papers/veach_thesis/ [2] http://www.bloomberg.com/bw/stories/2006-03-05/the-secret-to...


Slight correction: his sci-tech award was for MIS and "efficient Monte Carlo path tracing for image synthesis", which have only been in PRMan for the last two releases (PRMan 18 added geo area lights with MIS evaluation, and 19 added pure path-tracing), so his award wasn't for his work at Pixar really as that happened 12+ years before the techniques the award was for made it into PRMan.


Smaller standard error, significantly less memory, faster execution, equivalent rebalance cost... what's not to like?

Thanks, Google:

> Google has not applied for patent protection for this algorithm, and, as of this writing, has no plans to. Rather, it wishes to contribute this algorithm to the community.


> Google has not applied for patent protection for this algorithm, and, as of this writing, has no plans to. Rather, it wishes to contribute this algorithm to the community.

What they should do (IANAL) is patent it and license it properly. This way it's protected (patented), but the intent is clear and future-proof instead of leaving it in an ambiguous legal gray area.


> This way it's protected (patented)

Protected from what ?


Protected from me patenting it and then suing small companies that use it.


You can't patent something that someone else published.


funny joke ;)


... any more than you can patent something which has already been patented. Both require a lawsuit to challenge.


Except that's useless. Realistically you'd need a legal patent clause, because right now Google can change its stance on patenting it at any moment, and make your code violate the patent.


Not quite useless. The statement could prevent Google from later enforcing a patent against anyone who'd relied on that statement, via the doctrine of estoppel.

And Google can't quite change their mind "at any moment" – more than one year after the first public offer/description of the technique, under US law, it won't be patentable. So by June 9, 2015, if they haven't patented it by then, they won't be able to.

(It's likely already too late to patent in other jurisdictions, that don't have such a long public-description grace-period.)


I was taught that patents can't acquired on discoveries that have already been made public. If that's the case Google's publication of this method before attempting to patent it makes it unpatentable.


According to http://en.wikipedia.org/wiki/Public_disclosure there is a 1 year period before it becomes unpatentable.

However it does seem unlikely that there will be a patent. :-)


There is a 1 year grace period between disclosure and the deadline for patentability.

http://www.uspto.gov/web/offices/pac/mpep/s2133.html


Grace periods are shorter or non-existent outside the US. Publication date is June 9 of last year, so Google has less than 3 months to change their minds and turn-evil to pursue a patent on this technique.


That's a nifty algorithm. Sillysaurus3's wonderful explanation from last year helped me figure out what's happening. Mostly, at least (my fault, not his):

https://news.ycombinator.com/item?id=8140385

It feels like there should be some faster way of doing the jump calculation that doesn't require the floating point divide. Relatively, it's a slow and expensive operation, and doing ln(n) of them on each bucket lookup would add up to a lot if you have lots of buckets.


His explanation has mistakes. For example, in step 1 he claims "After this multiplication, the first 4 bytes and the last 4 bytes of 'key' are nonzero"

Take any 8 byte value b, multiply by 16133697096952638549ULL (mod 2^64) to get a value for the key. Now key * 2862933555777941757ULL = b. Thus there is an input value for key that I can make hit any number. The first and last 4 bytes are thus certainly not always nonzero.

He then uses "The fact that the first 4 bytes become nonzero is important for the next step" and proceeds. This is wrong.

For a concrete example, suppose key is 16133697096952638549ULL. Then key * 2862933555777941757ULL = 1. The first 7 bytes (actually, the first 63 bits) are all zero. Adding 1 as in the code still leaves the first 62 bits zero.

(the value 16133697096952638549ULL is the multiplicative inverse of the constant in the algorithm mod 2^64).


This has already been discussed here:

https://news.ycombinator.com/item?id=8136408

It's been in guava for a few years now.


HRW worth looking into as well if you have a smaller number of buckets: http://en.wikipedia.org/wiki/Rendezvous_hashing


How revolutionary is this? Will it replace all preceeding hash algos or does it have a narrow usefulness?


It's useful in the field of consistent-hashing: mapping items over shards where the number of shards varies over time. (Only some systems do this, and most hash algorithms are used in other fields.)

This technique also requires the shards be assigned contiguous integer ID starting at 0, and appears to work best if all shard entry/exit is intentional and at the end of the ID range. (That is: if you have 15 shards, they're named #0-#14. If you decide you need one less, you turn off shard #14.)

It doesn't appear to work well for systems with more chaotic entry/exit. (Shard #4 just crashed, and you don't need it anymore? You'd still want to restart/fill one shard as #4, then shut-down/drain shard #14. Other consistent-hashes could just map #4's old duties among the remaining nodes, albeit with their higher space/time costs.)


Would something like a network packet queueing algorithm, which hashes each stream and iterates through them (ex. Stochastic Fair Queue), benefit from this hashing algorithm?


Do those algorithms already make use of consistent hashing? If so, yes, but otherwise: probably not.


This is a nice algorithm and I thought about writing something that only depended on numbers and not on allocating nodes in a ring.

It works well when adding nodes, but then how does it work when removing a node in the middle? When allocating nodes in a ring it's easy, just reassign the vnodes. But what with this algorithm? Can anybody explain how does it keep track of which nodes take control of which part of the removed node?


The paper was really interesting to read, it motivated me to implement it in Go: https://github.com/beefsack/go-jch


Same here :) even did a blog post about it http://www.creapptives.com/post/113786047187/improving-shard...




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: