> False positives from SHA-1 hash collisions are detected after object retrieval from the disk by comparison with the requested key.
I was curious why they would bother, but it seems this isn't quite accurate.
What happens is they first use 32 bits from the SHA-1 hash to find the hash bucket, then they scan for the full SHA-1 of the key. They do not check for actual SHA-1 collisions.
edit: Also on the subject of hashes, the readme suggests switching to MD5 as a possible way to reduce entry size. That is unnecessary; SHA-1 can be truncated to whatever size you're comfortable with.
If you are willing to give up range scans and constrain yourself to fitting all keys in memory there is a lot you can do. See SILT (www.cs.cmu.edu/~dga/papers/silt-sosp2011.pdf) and Bitcask (downloads.basho.com/papers/bitcask-intro.pdf).
Since this is a cache I really dig skipping any kind of cleanup/compaction step for deleted/expired keys.
I played around with a similar thing except as a K/V store and the performance and density is pretty amazing. With a 64 byte key and 1.5k value (compressed from 2k) I was getting 85k inserts/sec and several hundred thousands reads/sec with a quad-core Sandy Bridge i5 and a 128 gigabyte Crucial M4 on SATAII.
Fitting all keys in memory is a pretty harsh constraint though. With your 64byte to 1.5k you are talking about a factor of 20, so for a 1TB SSD you need 64GB RAM, which costs about as much as the SSD.
"so for a 1TB SSD you need 64GB RAM, which costs about as much as the SSD."
In my experience this is not true. A 1 TB SSD, even commodity hardware, is pretty expensive. And that's not even talking about high performance SSD's like Fusion IO, where you're now talking about 10K+. Where do you see 64 GB of RAM costing that much?
> False positives from SHA-1 hash collisions are detected after object retrieval from the disk by comparison with the requested key
Is that check really necessary?
To have 1 in a trillion chance of having accidental SHA-1 collision they'd have to store 1.7*10^18 keys, and mere key index of that would require 54000 petabytes of RAM.
Comparing the sha-1 is much faster than reading the ssd, so it's almost free.
Accidental sha-1 collision is probably not a problem, but in a few years [1] it will be possible to crate sha-1 collisions and use that as an attack. It looks difficult, but supposes that with the correct string an attacker can retrieve the cached information of another user, for example sha1("joedoe:creditcard")=sha1("atacker:hc!?!=u?ee&f%g#jo").
I don't know if they are using randomization, because the collision can be used (in a few years) as a DOS atack [2]
Your example describes a preimage attack on SHA-1, not a collision attack. Even with a working collision attack you are probably still far away from taking "some.other.input" and creating a sha1("some.other.input") = sha1("johndoe:creditcard").
For instance MD5 collisions are really easy to create but for preimage attacks on MD5 there is still no better approach than just doing brute force.
Smells like a bad idea, I agree. You have more chances to have a hardware error or system error that will cause your branch to be incorrectly executed than having an accidental SHA-1 collision. And if you're worried about SHA-1 collisions, use SHA-2 or even SHA-3.
They shape their reads and writes to work well with SSD characteristics. Randomly reading and writing to an SSD (whether directly or through mmap) would be slower and wear out the SSD more quickly.
Modern SSDs simply don't work like this, they all internally use some variant of log-structured storage, such that regardless of the user's write pattern a single continuous stream is generated and only one method is needed to distribute modified pages across available flash. This means an infinite loop that rewrites the first 128kb of the device with random data will eventually fill (most of) the underlying flash with random data (128kb because that's a common erase block size).
Write patterns still matter until you're talking about like megablock granularity. Mmap will swap out pages at random (relative to disk layout) and page granularity is far smaller than a megablock. It's certainly possible for controllers to handle this properly, and I don't want to tell you that they never will, but even the very expensive PCI-E flash we use at fb demonstrated this "bad behavior".
Are there standard practices for securely erasing any random SSD without having to look up it's implementation details? Or is this the sort of thing you just use a shredder for?
Encrypt it and store the key anywhere except on the drive. To erase, simply destroy the key. Many motherboards come with a tamper-proof key storage device you can reset on command (the TPM). There's a SATA secure erase command, but its been shown multiple vendors have managed to botch its implementation. So if you can't make the encryption approach work, shredder is probably still best bet
Indirection in a log-structured form is the best way to increase write IOPs and optimize write amplification. More sophisticated SSDs actually have multiple log heads, for data with different life cycle properties.
You get a write amp of 1 until the drive is filled the first time. After that, it's a function of
1) how full the drive is (from the drive's point of view—this is why TRIM was invented)
2) the over provisioning factor
3) usage patterns, such as how much static data there is
4) how good the SSD's algorithms are
5) other (should be) minor factors, such as wear leveling
Just to further illustrate your point, the relevant section from the readme file:
There have been attempts to use an SSD as a swap layer to implement SSD-backed
memory. This method degrades write performance and SSD lifetime with many small,
random writes. Similar issues occur when an SSD is simply mmaped.
To minimize the number of small, random writes, fatcache treats the SSD as a
log-structured object store. All writes are aggregated in memory and written to
the end of the circular log in batches - usually multiples of 1 MB.
ptr = mmap(..., len, ..)
/* do stuff with ptr */
msync(ptr, len)
No sane OS will pay attention to an mmapped region while it isn't under memory pressure, so dirty pages are effectively buffered until you explicitly tell the OS to start writeback.
Linux flushes dirty pages in mmapped regions to disk every 30 seconds by default whether you like it or not (see vm.dirty_expire_centisecs). Unlike FreeBSD, Linux doesn't yet have support for the MAP_NOSYNC flag to mmap(2).
Good point, though I think for a machine averaging even 40mb/sec this only amounts to one 'batch' every 1.2gb receiving an extra sync. Linux recently changed so that pdflush doesn't exist at all, when the dirty list fills, a per-device thread is created that sleeps for 5 seconds before starting work, so maybe it's a bit less than 1.2gb.
Where [ptr1, ptr1+len1], [ptr2, ptr2+len2], ... are the chunks within a big mmap'ed region where changes occur and need to be written to disk for persistence.
Or do I just msync the whole region then hope and pray that the OS will do the right thing?
The number of write cycles to any block of flash is limited. Once the write counter for a block has hit the manufacturers hardcoded limit, the SSD will not trust that block to have data written to it anymore.
The whole point of this piece of software is to be smart about how and when it flushes data so it can minimize impact on the write counter.
All these level caching systems work mostly on different layers, bcache deals with the block layer, flashcache deals with device mapper et al and so does enhanceio.
yep. Aerospike definitely had the first implementation of this kind of ssd-based architecture I'd heard of. Use ram only for hot caching and metadata. Use ssd's with custom filesystems for persistent storage. It's a good idea.
It's too bad most cloud providers consider SSD disks to be a 'premium' feature. I guess this would work fine on custom-configured hardware at places like softlayer and serverbeach.
While that's a significant improvement over the 70ms assumption, it's still "around three orders of magnitude slower than dram and two orders of magnitude faster than spinning disks", and doesn't really change any of the conclusions you'd have drawn using the 70ms number.
"Slab item chunk sizes" suggests they retained memcache's external fragmentation problem (if you mostly have big objects expiring, those slabs of memory won't ever be reused for storing small objects, or vice versa). On memcached with no persistence, you could recover from this by restarting (if you could withstand the availability hit), but what do you do once you're relying on long-lived state?
This is not an issue for fatcache. slab item chunk size introduces internal fragmentation (because items sizes usually don't match "chunk" sizes unless you use slab profile, -z option, to match them up). The kind of external fragmentation you described is due to the eviction strategy memcached uses, which can be avoided by using slab level eviction strategies (Twemcache and latest Memcached both have this support). fatcache does slab level eviction based on write timestamps, which is the equivalent of LRC eviction in Twemcache, and you can read about the mechanism here: https://github.com/twitter/twemcache/wiki/Eviction-Strategie...
Cool, but what I was envisioning when I came up with the term "fat cache" over a year ago was client-side. This seems to be server side, unless I've misunderstood.
No, it's pretty sane. I work at a well-known NAS storage-appliance vendor and we're doing something quite similar in-kernel as a caching layer for our filesystem.
I was curious why they would bother, but it seems this isn't quite accurate.
What happens is they first use 32 bits from the SHA-1 hash to find the hash bucket, then they scan for the full SHA-1 of the key. They do not check for actual SHA-1 collisions.
edit: Also on the subject of hashes, the readme suggests switching to MD5 as a possible way to reduce entry size. That is unnecessary; SHA-1 can be truncated to whatever size you're comfortable with.