Facebook's Memcached Multiget Hole: More machines = More Capacity

bantic · on Oct 26, 2009

the title should be != More Capacity

compay · on Oct 26, 2009

Yes, as in - this is not sarcasm; the actual article title is: "Facebook's Memcached Multiget Hole: More machines != More Capacity"

sstrudeau · on Oct 26, 2009

Right.

Though, to argue with the article's title -- while a straightforward addition of more memcached nodes did not solve their capacity problem, the strategy of replicating nodes & load balancing read requests between them appears to be a solution to this particular capacity problem; so more machines = more capacity, so long as they are organized appropriately.

brown9-2 · on Oct 26, 2009

I'm not very familiar with memcached - is replication and load balancing built-in, or something facebook added on themselves?

lucifer · on Oct 26, 2009

Actually if you rewrite the equation you can see that:

   more capacity != more machines
   more capacity = more machines + load distribution strategy

vijaydev · on Oct 27, 2009

Yes. You are right. Should be a mistake while pasting the title of the article.

yummyfajitas · on Oct 26, 2009

Being a little smarter on the client side can help a lot here.

If you store and retrieve an object only to server # (object.id % numservers), you don't face this issue. (Obviously you could use a better hash than %.)

learnalist · on Oct 26, 2009

It might be obvious, BUT i know my knowledge doesnt include working with 28 terabtyes of data. Just reading it in the article is daunting!

I wonder what sort of sandbox they have to play on. They also probably lack the ability to Google or stackoverflow the problem.

sstrudeau · on Oct 26, 2009

This would help to distribute requests for individual objects, but doesn't help for multi-get requests. The problem is adding more nodes increases the number of actual requests a multi-get call needs to make (assuming you're asking for a sufficient # of objects they are likely to be distributed across all nodes). This decreases the # of keys requested per node but increases the total number of requests to the cluster. Because the bound was on throughput of requests (bound by CPU), minimizing the number of keys per request to a node doesn't help.

The proposed solution is to instead replicate nodes and load balance read requests. In this case, this doubles your read capacity, though you must write twice (or N times depending on your replication level).

henrikschroder · on Oct 26, 2009

It's worth noting that that solution only works if you have much fewer writes than reads, but that's probably true for most people. A really simple way of doing this that all memcached clients should be able to handle is to set up two separate memcached clusters, write to both, and randomly read from either of them. That way you don't replicate nodes, you replicate the entire cluster. :-)

A better solution would be to consolidate your items, i.e. make sure that items that often are fetched together with multi-get end up on the same server, however this requires a lot of extra knowledge about your data, and it's not very likely that this knowledge is available.

anamax · on Oct 27, 2009

> It's worth noting that that solution only works if you have much fewer writes than reads, but that's probably true for most people.

Is it true of tweets? How about e-mail? What facebook content works that way? (IIRC the discussion of their new image store, one of the motivations was that most pictures were never viewed.)

confusedcitizen · on Oct 26, 2009

Well, this is the underlying logic indeed. What the author mentions is seeing 50 requests for 100 friends, where these friends might be divided into 3 servers. So, the keys are hashed to servers, i.e., if there is only 1 multi-get request, each server would see only 1 request. However, if there are 50 multi-get requests, each server would see 50 requests irrespective of the number of keys (friends) it caters to.

henrikschroder · on Oct 26, 2009

This is the part of the article I really didn't understand.

The way multiget works is that you send a list of keys to a server, and get back a list of items corresponding to those keys. So if you have 100 items on two servers, you would send 50 keys to Server #1 and 50 keys to Server #2 (assuming they're evenly distributed) and get back 50 items from each.

And then the article suddenly talks about adding another server and having 33 items on each, and for some reason sending 50 requests to the two original servers. This makes no sense, of course 50 requests consume more resources than the one request in the previous example?

However, if we back up a bit, add another server and redistribute the items, then there will be 33 on each server. If we do a multi-get for these items, we will make three requests; one for each server, put 33 keys in each request, and get back 33 items in each response.

The point I think he should have made is that in this case you've not reduced the load on each server, only the amount of data sent in each, and since your two original servers were CPU-bound, you haven't gained any performance, they're still choked, they're still seeing the same amount of requests.

Then again, you have to have a crazy amount of requests, and a crazy amount of bandwidth before you'll start seeing memcached servers being CPU-bound, so for the absolute majority of us, this problem is not something we'll ever see in real life. :-)

confusedcitizen · on Oct 26, 2009

No no, I think he speaks of 50 requests from the get go, i.e., 50 requests of 100 friends each, irrespective of the number of servers. These 50 requests is unrelated to the number of friends, it's just some huge number of requests. He claims that no matter how you divide the data, if a certain number of requests create a CPU bottleneck, they'll continue to do so no matter how you increase memory.

almost · on Oct 26, 2009

This seems so obvious there must be a good reason why they're not doing it. Anyone know?

yummyfajitas · on Oct 26, 2009

I'd speculate that they have good reasons for not making the client too smart. For example, they might not have control over all the clients...