HN2new | past | comments | ask | show | jobs | submitlogin
fatcache - Memcache on SSD (github.com/twitter)
175 points by buttscicles on Feb 11, 2013 | hide | past | favorite | 54 comments


> False positives from SHA-1 hash collisions are detected after object retrieval from the disk by comparison with the requested key.

I was curious why they would bother, but it seems this isn't quite accurate.

What happens is they first use 32 bits from the SHA-1 hash to find the hash bucket, then they scan for the full SHA-1 of the key. They do not check for actual SHA-1 collisions.

edit: Also on the subject of hashes, the readme suggests switching to MD5 as a possible way to reduce entry size. That is unnecessary; SHA-1 can be truncated to whatever size you're comfortable with.


If you find this interesting, you might also find this article in Communications of the ACM interesting:

Michael Cornwell: Anatomy of a Solid-State Drive

Communications of the ACM, Vol. 55 No. 12, Pages 59-63

http://cacm.acm.org/magazines/2012/12/157869-anatomy-of-a-so...

Alternative link in case the first is paywalled for people not on a university campus: http://queue.acm.org/detail.cfm?id=2385276

It goes into quite some detail on how SSD storage works on a system level and how it differs from hard disks.


If you are willing to give up range scans and constrain yourself to fitting all keys in memory there is a lot you can do. See SILT (www.cs.cmu.edu/~dga/papers/silt-sosp2011.pdf) and Bitcask (downloads.basho.com/papers/bitcask-intro.pdf).

Since this is a cache I really dig skipping any kind of cleanup/compaction step for deleted/expired keys.

I played around with a similar thing except as a K/V store and the performance and density is pretty amazing. With a 64 byte key and 1.5k value (compressed from 2k) I was getting 85k inserts/sec and several hundred thousands reads/sec with a quad-core Sandy Bridge i5 and a 128 gigabyte Crucial M4 on SATAII.


Fitting all keys in memory is a pretty harsh constraint though. With your 64byte to 1.5k you are talking about a factor of 20, so for a 1TB SSD you need 64GB RAM, which costs about as much as the SSD.


"so for a 1TB SSD you need 64GB RAM, which costs about as much as the SSD."

In my experience this is not true. A 1 TB SSD, even commodity hardware, is pretty expensive. And that's not even talking about high performance SSD's like Fusion IO, where you're now talking about 10K+. Where do you see 64 GB of RAM costing that much?


Micron unveils 1TB SSD for $600, which is about what 4x16GB ECC RAM is...

https://www.computerworld.com/s/article/9235277/Micron_unvei...

Fusion IO is probably as expensive as RAM, so only makes sense for apps that really need persistence, ie databases.


Wonder how much of the SATAII was the bottleneck, my guess is quite a lot. You should give your test another go with SATAIII.


> False positives from SHA-1 hash collisions are detected after object retrieval from the disk by comparison with the requested key

Is that check really necessary?

To have 1 in a trillion chance of having accidental SHA-1 collision they'd have to store 1.7*10^18 keys, and mere key index of that would require 54000 petabytes of RAM.


Comparing the sha-1 is much faster than reading the ssd, so it's almost free.

Accidental sha-1 collision is probably not a problem, but in a few years [1] it will be possible to crate sha-1 collisions and use that as an attack. It looks difficult, but supposes that with the correct string an attacker can retrieve the cached information of another user, for example sha1("joedoe:creditcard")=sha1("atacker:hc!?!=u?ee&f%g#jo").

I don't know if they are using randomization, because the collision can be used (in a few years) as a DOS atack [2]

[1] http://www.schneier.com/blog/archives/2012/10/when_will_we_s...

[2] http://www.gossamer-threads.com/lists/python/dev/959026


Your example describes a preimage attack on SHA-1, not a collision attack. Even with a working collision attack you are probably still far away from taking "some.other.input" and creating a sha1("some.other.input") = sha1("johndoe:creditcard").

For instance MD5 collisions are really easy to create but for preimage attacks on MD5 there is still no better approach than just doing brute force.


Smells like a bad idea, I agree. You have more chances to have a hardware error or system error that will cause your branch to be incorrectly executed than having an accidental SHA-1 collision. And if you're worried about SHA-1 collisions, use SHA-2 or even SHA-3.


What are the odds of creating an intentional SHA-1 collision in a chosen plaintext situation?


Still hasn't been done.


Oversights now will ensure that we can't use this system when we have the capability of storing that many petabytes in RAM.

By then, maybe the founders of this system won't even be alive.


Curious, why not just use an mmapped file for the store a la varnish? If the file lived on the SSD you'd get in-memory caching for free from the OS.


They shape their reads and writes to work well with SSD characteristics. Randomly reading and writing to an SSD (whether directly or through mmap) would be slower and wear out the SSD more quickly.


Modern SSDs simply don't work like this, they all internally use some variant of log-structured storage, such that regardless of the user's write pattern a single continuous stream is generated and only one method is needed to distribute modified pages across available flash. This means an infinite loop that rewrites the first 128kb of the device with random data will eventually fill (most of) the underlying flash with random data (128kb because that's a common erase block size).


Write patterns still matter until you're talking about like megablock granularity. Mmap will swap out pages at random (relative to disk layout) and page granularity is far smaller than a megablock. It's certainly possible for controllers to handle this properly, and I don't want to tell you that they never will, but even the very expensive PCI-E flash we use at fb demonstrated this "bad behavior".


Are there standard practices for securely erasing any random SSD without having to look up it's implementation details? Or is this the sort of thing you just use a shredder for?


Encrypt it and store the key anywhere except on the drive. To erase, simply destroy the key. Many motherboards come with a tamper-proof key storage device you can reset on command (the TPM). There's a SATA secure erase command, but its been shown multiple vendors have managed to botch its implementation. So if you can't make the encryption approach work, shredder is probably still best bet


Standard practice in government and large enterprise is still physical destruction, for exactly the reason you mention.

http://www.monomachines.com/shop/intimus-crypto-1000-hard-dr... or you can get a service to come out and do it on site.


Does it make any difference for append-only writes vs. in-place modifications?


Can you cite anything for this? Wear leveling is not the same as log-structured storage.


Indirection in a log-structured form is the best way to increase write IOPs and optimize write amplification. More sophisticated SSDs actually have multiple log heads, for data with different life cycle properties.

You get a write amp of 1 until the drive is filled the first time. After that, it's a function of 1) how full the drive is (from the drive's point of view—this is why TRIM was invented) 2) the over provisioning factor 3) usage patterns, such as how much static data there is 4) how good the SSD's algorithms are 5) other (should be) minor factors, such as wear leveling

Source: I used to be a SSD architect.


Just to further illustrate your point, the relevant section from the readme file:

    There have been attempts to use an SSD as a swap layer to implement SSD-backed
    memory. This method degrades write performance and SSD lifetime with many small,
    random writes. Similar issues occur when an SSD is simply mmaped.
    
    To minimize the number of small, random writes, fatcache treats the SSD as a
    log-structured object store. All writes are aggregated in memory and written to
    the end of the circular log in batches - usually multiples of 1 MB.


It would be trivial to batch writes to the mmap'ed region. Reads would still benefit from OS caching.


How do you batch writes to mmap-ed region?


    ptr = mmap(..., len, ..)
    /* do stuff with ptr */
    msync(ptr, len)
No sane OS will pay attention to an mmapped region while it isn't under memory pressure, so dirty pages are effectively buffered until you explicitly tell the OS to start writeback.


Linux flushes dirty pages in mmapped regions to disk every 30 seconds by default whether you like it or not (see vm.dirty_expire_centisecs). Unlike FreeBSD, Linux doesn't yet have support for the MAP_NOSYNC flag to mmap(2).


Good point, though I think for a machine averaging even 40mb/sec this only amounts to one 'batch' every 1.2gb receiving an extra sync. Linux recently changed so that pdflush doesn't exist at all, when the dirty list fills, a per-device thread is created that sleeps for 5 seconds before starting work, so maybe it's a bit less than 1.2gb.


Does this mean that I have to msync multiple mmap'ed chunks in order to batch write? For example

    msync(ptr1, len1)
    msync(ptr2, len2)
    msync(ptr3, len3)
Where [ptr1, ptr1+len1], [ptr2, ptr2+len2], ... are the chunks within a big mmap'ed region where changes occur and need to be written to disk for persistence.

Or do I just msync the whole region then hope and pray that the OS will do the right thing?


Shouldn't it be possible to make the linux kernel behave like this with the swap partition on SSD, to get this benefit for all programs?


The number of write cycles to any block of flash is limited. Once the write counter for a block has hit the manufacturers hardcoded limit, the SSD will not trust that block to have data written to it anymore.

The whole point of this piece of software is to be smart about how and when it flushes data so it can minimize impact on the write counter.


They actually use mmap internally.


They use MAP_ANONYMOUS, not a file-backed mmap.

https://github.com/twitter/fatcache/blob/master/src/fc_util....


So apparently inventing your own SSD caching system is all the rage? We already had flashcache from Facebook : https://github.com/facebook/flashcache/ And bcache : http://bcache.evilpiepirate.org/

However this one isn't kernel-based, so it won't help your NFS server or your postgresql engine. On the other hand it's much easier to build.


Block devices and memcached are totally different; of course you're going to need different implementations for different APIs.


However a block layer cache can enhance any kind of IO.


All these level caching systems work mostly on different layers, bcache deals with the block layer, flashcache deals with device mapper et al and so does enhanceio.


Cool idea, I believe aeorspike uses ssd's for key value store like this. Check out http://www.aerospike.com/performance/architecture/ .


yep. Aerospike definitely had the first implementation of this kind of ssd-based architecture I'd heard of. Use ram only for hot caching and metadata. Use ssd's with custom filesystems for persistent storage. It's a good idea.

It's too bad most cloud providers consider SSD disks to be a 'premium' feature. I guess this would work fine on custom-configured hardware at places like softlayer and serverbeach.


The README says that memory is 1000s of times faster than SSD.

With that, how much faster is SSD over disk and then memory over SSD?


Latency Numbers Every Programmer Should Know: https://gist.github.com/jboner/2841832 discussion: https://hackernews.hn/item?id=4047623

DRAM (~100 ns) is very nearly 1000x better latency than SSD (~70 us) which is only 100x faster than disk (~10 ms).


Actually, modern SSDs have lower latency than 70 microseconds:

26 microseconds for OCZ Vertex 4 (reads)

http://thessdreview.com/our-reviews/ocz-vertex-4-128gb-ssd-r...

No more than 20 microseconds for Corsair Neutron GTX and Vertex 4 (writes):

http://www.storagereview.com/samsung_ssd_840_pro_review


While that's a significant improvement over the 70ms assumption, it's still "around three orders of magnitude slower than dram and two orders of magnitude faster than spinning disks", and doesn't really change any of the conclusions you'd have drawn using the 70ms number.


OCZ compromises a lot to get high marks on AnandTech benchmarks and great Farcry 3 FPS reports on newegg.

You should be looking at companies like STEC or higher end Intel SSDs for server applications.


This whitepaper goes into some detail about how/why to use SSDs as cache in a zfs fileserver: http://info.nexenta.com/rs/nexenta/images/white_paper_nexent...


"Slab item chunk sizes" suggests they retained memcache's external fragmentation problem (if you mostly have big objects expiring, those slabs of memory won't ever be reused for storing small objects, or vice versa). On memcached with no persistence, you could recover from this by restarting (if you could withstand the availability hit), but what do you do once you're relying on long-lived state?


This is not an issue for fatcache. slab item chunk size introduces internal fragmentation (because items sizes usually don't match "chunk" sizes unless you use slab profile, -z option, to match them up). The kind of external fragmentation you described is due to the eviction strategy memcached uses, which can be avoided by using slab level eviction strategies (Twemcache and latest Memcached both have this support). fatcache does slab level eviction based on write timestamps, which is the equivalent of LRC eviction in Twemcache, and you can read about the mechanism here: https://github.com/twitter/twemcache/wiki/Eviction-Strategie...


I believe go-memcached is faster than fatcache - https://github.com/valyala/ybc/tree/master/apps/go/memcached . Will return with performance numbers soon :)


Cool, but what I was envisioning when I came up with the term "fat cache" over a year ago was client-side. This seems to be server side, unless I've misunderstood.

https://hackernews.hn/item?id=3544522


This is insane.

Awesome, useful, and cool, but insane.

I like it.


No, it's pretty sane. I work at a well-known NAS storage-appliance vendor and we're doing something quite similar in-kernel as a caching layer for our filesystem.


When I said "insane", I meant more "this is not something you would expect or that is immediately obvious". My phrasing wasn't the finest.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: