A File System All Its Own – specialized for SSDs

Sami_Lehtinen · on April 14, 2013

Layering stuff for legacy reasons isn't anything new. It was a smart idea to connect flat digital displays using ADC and VGA cable and display adapter with DAC. Still many did it, and some people even doing it today. It doesn't still make any sense what so ever.

vy8vWJlco · on April 15, 2013

It makes perfect sense when, say, that's the only cable you have on hand and when "mostly working for cheap" beats "not working, but technically correct and optimal." And then there are those to whom "lossy but non-DRM analog" (VGA) typically beats "lossless digital with DRM capability that will sneak up on you when you least want it" (HDMI).

petsos · on April 15, 2013

What the hell am I going to do with a VGA display when the only output I have is a serial cable? And then why should I pay for a VGA cable? I can connect my vt100 anywhere with a cable I can make myself.

vy8vWJlco · on April 15, 2013

I'll bite. I didn't say VGA was cutting egde, or the one true cable, or that real programmers use butterflies - only that it remains practical for a very large number of uses. I believe the longevity of VGA has a lot to do with its compatibility and explicit lack of DRM features. I have often had a choice and gone with "good enough" VGA when digital was an option, simply because I am aware that by buying HDMI I am not only financially supporting and licensing DRM, but am committing to a technology that can be used against me. I may not be the norm, but I am far from a Luddite.

shrughes · on April 15, 2013

There are other digital options besides HDMI...

vy8vWJlco · on April 15, 2013

Yes, DVI was popular there for a while, and there are an assortment of others, but HDMI seems to have outpaced DVI in my encounters (and the others are quite niche: some Mac, Intel, etc...). DVI is still attractive, but things are shipping without it in favor of HDMI (in my experiences), and that only further helps extend VGA's lifespan. People already have VGA cables, and their only "upgrade" path to digital is often HDMI, so VGA remains the lesser of evils based on price and personal lifestyle/ethics.

LukeShu · on April 15, 2013

My experience is similar, but I'll add that DVI->HDMI cables are cheap, and mean that you can use DVI even when your displays are HDMI.

joe_bleau · on April 14, 2013

The MLC and SLC NAND trends in figure 1 are confusing me. Historically, wasn't SLC first? Yet the graph shows pricing for MLC back to 2001, and SLC back to only 2007-ish. It correctly shows that MLC is less expensive than SLC.

Maybe he didn't have old price data for SLC?

mbjorling · on April 14, 2013

It's hard for people to get data on flash chips / prices without signing an NDA. He probably found the data from various places and stitched it together from that.

joe_bleau · on April 14, 2013

Hmm. I read the chart as price for the entire drive, not the 'raw materials' of the drive.

adamleventhal · on April 16, 2013

The data is from the good folks at Objective Analysis.

kalleboo · on April 15, 2013

> Again, this approach today requires a vendor that can assert broad control over the whole system—from the file system to the interface, controller, and flash media.

Apple would be well-positioned here if they still cared about their Macs. HFS is due for a replacement anyway after 30 years. (it could be done on iOS devices too, but flash I/O performance doesn't seem to be a the major bottleneck for those uses)

DanBC · on April 16, 2013

I would love Apple if they created an excellent file systems specifically for SSD use, and if they also opened that file system to other users.

HFS+ has some nice features. I don't know anyone not running OS X using it.

baruch · on April 15, 2013

They actually have all the components under their arm, they bought Anobit which made fabulous SSDs and are snatching engineers around here.

Anobit SSDs were the best I've seen so far in terms of consistent performance.

adamleventhal · on April 16, 2013

Agreed that Apple is very well-positioned for this -- the question is whether they care about the problem. A purpose-built filesystem would improve performance and longevity at a lower cost. Is it worth it for Apple to invest in a brand new file system and data path?

baruch · on April 14, 2013

The problem is that most users (home & enterprise) just want things to work, they don't really care much how to get there and to have the best efficiency.

It won't be too hard to have a good filesystem that works over raw NAND flash but it will not work on older OSes, it will not work in the enterprise storage market and so there will be less buyers and thus it will cost more so no one will buy it and it will not be made.

Even the enterprise storage folks just want the damn flash devices to just work without the storage folks doing anything with them. It's taken to extremes sometimes and the flash vendors just do whatever they are told since there is a lot of market in whatever the software-defined engineers want. Except the engineers mostly want to deal with high level algorithms and to brag how fast their algorithm is without really thinking about the hardware. Hardware is hard. Besides they can do something with the hardware that is already on the market rather than envision something better.

TL;DR unless someone will hold the stick at both ends (software and hardware) no one will make a reduced layer solution.

dman · on April 14, 2013

Hardware makers make this hard for themselves. Having expensive devkits and requiring NDA's for doing the most trivial things creates a dynamic where the only players able to jump through the monetary and bureaucratic hoops are large vendors. Those vendors are unable to deliver true innovation because they have so much vested in the status quo.

lgeek · on April 15, 2013

> The problem is that most users (home & enterprise) just want things to work, they don't really care much how to get there and to have the best efficiency.

When it comes to research, no one cares that much about what home users and enterprise-users-small-enough-not-to-use-custom-software-stacks want now. Case in point: I don't think many IT mangers were that eager to switch to using ZFS in production when it was announced back in 2004 (and ZFS wad been under development for years at that time).

I've considered doing a PhD project pushing and stretching the boundaries of SSD firmware/operating system/filesystems because I think there's a lot of improvement that can be done in this area. The cost of OpenSSD that the sibling comments mention wasn't even that much of a problem. I seriously don't think someone not associated with a research department somewhere would have the time and/or know-how to do original research and implement a working prototype. Hell, I might get a devkit, but I doubt I'll do anything interesting and original at the same time. Which brings us to the actual problem:

Documentation and NDAs. For lots of ICs, microcontrollers, processors you can freely get hardware documentation, programming manuals, etc. For flash controllers and high-density NANDs? Almost nothing at all. Maybe some stuff can be reverse engineered and you get some documentation for OpenSSD. But the NAND manufacturer won't tell you stuff that's really, really important about things like failure patterns, which would allow you to optimize error correction and wear leveling for example.

baruch · on April 15, 2013

Documentation is an issue indeed.

I don't think that you really need the inner information about NAND to do the original and innovative research. It really depends on the area you want to work on, the SSD firmware level might require that but the SSD makers are already on that route (some better than others). The other level is to not pay too much attention to the differences between NAND chips and just implement something at a higher level to push the hardware-agnostic smarts to the OS.

The OpenSSD also lacks documentation, last time I looked at it there was no info on how to do NCQ on the SATA interface, without which there is no talking about a speedy SSD.

baruch · on April 14, 2013

Case in point for the cost: I looked at buying an OpenSSD (referenced in the article) to write my own firmware to a lowly 64GB device. Was quoted $2000. There is nothing else getting close to this openness on the market.

adamleventhal · on April 16, 2013

... and the prognosis for OpenSSD is terminal.

baruch · on April 17, 2013

It seems to be geared only towards research insititutes and a specific one at that, most of the work using OpenSSD is concentrated on that one university.

krschultz · on April 14, 2013

It may happen in the server world. Engineering wins the day a whole lot more often than in the consumer market. ZFS et al are a good case in point.

mav3r1ck · on April 15, 2013

Not quite sure what this article is talking about:

https://en.wikipedia.org/wiki/List_of_file_systems#File_syst...

I personally believe log-based file systems are a perfect match due to never saving the same file repeatedly to the same location (so provides built in wear-leveling) and one can optimize writes by always clearing the head of the log for the next write.

adamleventhal · on April 16, 2013

That list is specious at best.

Take ZFS. I designed the flash integration for ZFS; it's used as a caching tier. ZFS is definitely not optimized for use with flash as its primary backing store. The same is true for some of the other filesystems in the list; offhand: CASL and WAFL.

Most of the rest are designed for embedded use cases, are research toys, or are embedded research toys.

stcredzero · on April 14, 2013

The memory hierarchy needs to be revised to take into account the different performance characteristics of Flash RAM vs. hard drives. There is no disputing that NAND Flash SSD are very different from Dynamic RAM, static RAM, and HD.

jpalomaki · on April 14, 2013

Wouldn't it make sense to use an object storage style interface to SSDs? Instead of managing sectors and cylinders the SSD would provide interface for managing objects, pretty much like cloud storage services like S3.

mbjorling · on April 14, 2013

It's one way to look at it. However, having an object interface toward the SSD does not solve the problem of variability that the author mentions.

The variability is caused by the "incompatible" NAND flash interface (read, write and erase), while the IO interface to the host system is read/write (and occasional trim to let the device know of unused pages). Therefore, another interface, other than a simple read/write is the holy grail. This interface might be one that give various guarantees for the user, e.g. atomic operations, etc. It doesn't need to only be an object / page store.

baruch · on April 15, 2013

For some uses this may be a better option, it will still require handling of the flash chips. Assuming SAS connectivity as the interface this is perfectly possible with the OSD standard from T10 (SCSI committee).

SATA is not going to work since it is a block interface only and not easily extensible in a sane way.

vy8vWJlco · on April 15, 2013

"Layering the file system translation on top of the flash translation is inefficient and impedes performance."

"For many years SSDs were almost exclusively built to seamlessly replace hard drives; they not only supported the same block-device interface"

The point of storage is to be able to put anything you want on it. That contract is the block interface, and includes the ability to change the filesystem. A file with internal structures is also a filesystem. The interfaces are fine. Change for change's sake should be avoided. (Providing a bypass, SSD-optimized interface is fine, but, ahem: "put down the crack pipes"... https://hackernews.hn/item?id=5541063 )

petsos · on April 15, 2013

> That contract is the block interface, and includes the ability to change the filesystem.

The article argues that we should change this contract.

> The interfaces are fine.

The article argues that they are not.

> Change for change's sake should be avoided.

The article argues that we should change them for performance's sake.

vy8vWJlco · on April 15, 2013

>> Change for change's sake should be avoided. > The article argues that we should change them for performance's sake.

I also said providing a bypass was reasonable, but the article gives the impression that the block-level interface is yesterday's jam, and something less than the starting point. It is the starting point and will continue to be because block-level storage is the major use-case. Block-level access accounts for basically all bulk storage in /dev. Extra performance and features (ioctl calls, or a management interface) are gravy, but without the block-level interface, it is not accessible to 99.9% of all software and will not serve for general storage, including pre-existing filesystems, from FAT12 to Btrfs. There are applications for which those are interfaces. There's no reason to make it difficult to apply those layers. Looking forward, if it can't store current and unforseen filesystems out of the box (the block interface), even if that means a 50% reduction in speed, it doesn't deserve the name "mass storage." Noone wants a key-value store even if it is 100% faster. They may be faster, but it doesn't look like storage. And there's no need for it to look different either, because people expect to be able to use it like a block device. Take WD Green drives with their larger allocation sizes. There happens to be an optimal cluster size, but the interface is still that of block IO. If flash works best for a certain allocation size or other tweaks, or even a specific high-level formatting, fine, but it's still going to need to provide the block IO interface as a starting point if it's going to be used as a drive.

It's nice to see at least one punitive downvote though.

kalleboo · on April 15, 2013

> Block-level access accounts for basically all bulk storage in /dev

Unless your home and binary directories are stored on an NFS share.

vy8vWJlco · on April 15, 2013

Sure, but even then, on the server - where the storage medium exists, and the bytes are held - it's stored on the operator's choice of filesystem, made possible by the ubiquitous block IO interface (or the platform-specific equivalent thereof).

Edit: Network block device, iSCSI, AoE, etc... the block interface is the lifeblood of the storage area network.

baruch · on April 15, 2013

I actually agree that using the block interface makes sense for storage but would have loved to have a minimal interference from the SSD. If it just exposed the entire flash to address by the user and reported when a block has problems and maybe some stats about the underlying flash chips I'd be very happy. It would then enable building better things at the OS/Application side.

There is quite a bit of a chicken-and-egg problem here though, all the current filesystems basically assume that the underlying media never has any faults and if it does than the media problems are static and do not develop over time. This is obviously incorrect for flash, but even for rotating media it wasn't true. Since every OS requires a fault-free media the SSD vendors are working hard to make sure they provide a semblance of such fault-free media they make it harder to provide the best possible performance or a different trade-off than what they have taken.

vy8vWJlco · on April 15, 2013

I also want lower-level access, but I would presumably not be using it for files/reliable storage.

If a lossy interface is acceptable, why couldn't SSDs simply expose a faster albeit lossy block device and, if necessary, an extended SMART or custom inspection method, and let the user take responsibility for wear-leveling, ECC, etc? It would be backwards compatible with other things by virtue of presenting the standard block interface. A common ECC+wear-leveling middle layer could evolve allowing use of standard filesystems and a common codebase for all flash storage, relieving the apparent burden on SSD vendors who would love to sell fast unreliable storage rather than reliable storage.

I think the chicken-egg problem is just an egg problem though, because even though I'd love lower level access to the unreliable bits, I have to expect that the market for unreliable storage is very small, not unlike the high-efficiency-but-sometimes-exploding toilet. :)

If reliable storage is the primary use-case, maybe the drives just need to be smarter to keep up. I'd rather an ASIC handle ECC, etc, transparently (for the same reasons I'd rather have a dedicated GPU) than run ECC (or 3D floating point software) on my general processor. If you inevitably want reliable storage and just wind up running ECC, etc, on the CPU, the speed gains disappear and we're back in something similar to a pre-DMA world with the main processor doing something that could be done in parallel by a dedicated chip. If the hard drive is the right place for the offload, I'd rather the economic pressures remain for the SSD vendors to optimize inside that black box, behind the standard reliable interface.

That said, again, I too would love finer-grain control.

baruch · on April 15, 2013

I can see a mix where some parts are done by hardware, ECC comes in there. Other things should be done in software (FTL, error recovery, RAID).

The block interface itself is actually matching the flash, you read/write/erase in blocks, they may not be 512 bytes but rather 4k/8k/256k whatever works for the underlying hardware.

Sami_Lehtinen · on April 15, 2013

Don't you remember MFM and RLL with CHS addressing? Disk controller wasn't mapping bad blocks out, it was the file system.

baruch · on April 15, 2013

Indeed, but life then was different, the CPU was weaker and there was a good intention with pushing everything to the disk. The intention is still good but it is now possible to do things better in a different way.

hobbes78 · on April 14, 2013

Is exFAT no good?

mbell · on April 14, 2013

In short - No its not a very good file system.

Even ignoring the licensing / patent issues with it, its non-journaled and only has a single FAT in most implementations; Its easily corrupted and difficult to repair. It also lacks a number of useful features like pre-allocation, robust meta-data, etc.

frozenport · on April 14, 2013

As a side note, Lustre seems to work fine on SSDs.

adamleventhal · on April 16, 2013

Everything works "fine" on SSDs! They were designed to drop right into place.