If you are the target audience for these drives, you likely don't even consider regular (i.e. including random) IO as a use-case for modern HDD's . 10+ TB HDD's really only make sense for sequential IO, even if they technically still support random writes (e.g. the 10 & 12 TB PMR drives): the order of magnitudes of difference in random vs sequential IO performance make this a no-brainer.
If you look at the design of, for example, DropBox Magic Pocket, or Infinidat & Qumulo, you'll notice that their HDD access is really as sequential as possible. And if your storage layer is thus already optimized towards sequential writes, why not take the opportunity and get some capacity "for free" by adopting SMR drives?
Copy on write filesystems can probably be optimized for SMR by using TRIM commands to punch holes and rewrite the content sequentially in a new zone. Afaik both zfs and btrfs have plans to do this.
That way they can be useful for more than archival.
I was thinking that F2FS might be a good filesystem as a base on which you'd use an object storage abstraction layer (like ceph)...
However since I initially saw the news of these drives a few days ago Samsung also axed some Linux devs, which gives me pause and makes me reconsider the long term viability of this filesystem...
A full blown filesystem is overkill for an object store. You could use something like libzbc ( https://github.com/hgst/libzbc ) to write directly to the SMR drives on the block level.
I believe Ceph now has abstracted the drives away through BlueStore, which simply puts a large RocksDB database on the drive, bypassing most of the functionality a filesystem offers. It should be much easier to make an SMR compatible version of the LSM-tree backend of RocksDB, than writing a full-blown file system.
Still seems random access might be an issue. But would love to see how eg nilfs2 (maybe on top of software raid) benchmarks against zfs on these big drives.
Depends... my NAS is mostly used for video/audio archives and generally only one device in the house using it, playing back a single file, or listing a directory. Would probably be fine in that use case. Beyond that, it's used for backups.
Seems like a decent use case. I'm not sure how well they'd work in a NAS with Raid-5/6 though. I'd been considering a new nas with 4-6 drives at 8-12TB already. Random I/O isn't my primary use, and I'm sure there are others at these sizes.
This is one reason why RAID 0+1 is a best practice, and RAID 5 & 6 are no longer recommended. It takes too long to rebuild the array, leading to a multi-failed disk situation.
RAID6 should be fine rebuilding online (in RAID5 mode) even under a moderate write load.
Of course one should source RAID disks form 3 different vendors, to ensure that they are from different batches, and are not going to fail at approximately the same time.
Good advice, though I once had about half a dozen drives (12 drive RAID Z2 with 2 as hot spares) fail within a few weeks of each other in separate batches from sourcing. (Seagate 3TB drives, I think there's been articles on how bad that series was).
I don't know how i survived those Seagates. Lasted maybe a year and started dropping like flies. Synology seems to recommend identical drives as i recall but work fine with different sizes and makers afaict.
CRUSH is an example (and not the first) of a "distributed rebuild" approach: you have an array of N drives (with N large, e.g. 100), and if 1 drive fails, you read in parallel from all (N-1) remaining drives, while distributing the reconstructed data across the remaining available capacity of all (N-1) remaining drives.
In effect, you get the total bandwidth of (N-1) HDD's working in parallel. And the bandwidth of 100 HDD's doing sequential IO in parallel is really massive ( ~ 10 GB/s).
Examples of companies claiming to use this approach are Qumulo (rebuild in couple of hours), Infinidat (couple of 10's of minutes), ClusterStor GridRAID (now part of Seagate I think), or "Declustered RAID" in GPFS (IBM)
as opposed to raid 5 where if any two disks fail your array is toast, raid 6 increases this to 3.
However both raid 5 and 6 have 2 huge problems:
Data inflight at write time (power/hardware failures are more likely to corrupt the array, especially silently, which is the worst outcome).
Parity calculations require you to spin up the whole raid5/6 array during a rebuild, massively increasing the chance of a multi drive failure and a lost array. If one close-to-EOL drive dies, putting its sister drives through what is essentially an all day full tilt stress test is a terrible, terrible idea, and this idea keeps getting worse (takes longer) as drive sizes grow.
raid 0+1 sidesteps these issues mostly at a modest increase in drive count, its a no brainier for most setups.
Data inflight at write time (power/hardware failures are more likely to corrupt the array, especially silently, which is the worst outcome).
How is that? RAID doesn't affect data persistence behavior in any meaningful way. FUA/SyncCache/etc are supported by RAID controllers same as the underlying disks in writeback enviroments, parity updates included. Put another way, if you FUA or flush the writeback cache, those operations won't complete in a properly implemented RAID environment until the data is persisted somewhere, even if that means passing FUA down to the underlying storage. Granted there are a number of ways to mess this up, RMW cycles in a controller that doesn't have some kind of persistent memory and flush on power restore. Anyway, none of this is any worse than what happens in any other WB cached storage technology.
Finally, all this fearmongering about loss on rebuild is also something that should be more fully explored in the context of the fact that decent RAID systems run background scrub operations on a regular basis. Those operations by themselves are going to "stress test" the array on a regular basis when its consistent and not degraded. I've actually got a fair amount of experience in this area, and I'm here to tell you that if you think this is a risk consider what happens to non-raided unscrubbed drives that have a lot of data silently bitrotting on the platters. That latter effect is nearly always the problem in RAID environments when someone starts a rebuild on drives/sectors that have been unread for extended periods of time. But, in the case of RAID, a properly implemented system won't fail a drive for a single read failure during a rebuild, instead reconstructing from the other drives and leaving the drive online long enough to complete the rebuild and then taking it offline.
Basically raid 1 setups don't actually fix any of these problems, except through the use of massive additional parity disks overhead. Overhead that can also be applied to other RAID algorithsm to much better effect. AKA a mirrored RAID 6 provides far more protection than a mirrored raid 0. Similar levels can be had with 6+6 in environments where that is possible, with trivial capacity overhead.
Raid 5/6 require parity calculations before data can be written to disk. This is a significant amount of data, especially at high writing speeds. That is what causes the inflight data problem.
Battery and flash backup on controllers dosen't fix the problem of hardware failure (which is significant, especially on big hot controllers.
Again, decent controllers have ECC protection and the like, and frequently are available in HA configurations if your worry is controller failure (along with redundant/dual data paths to the media via SAS/NVMe/etc). Plus, there are a long list of technologies that can be enabled at the HBA layer and pushed all the way to the media (T10 DIF/DIX comes to mind).
But much of this micro level redundancy is overkill as frequently one uses some kind of application level HA/redundancy as well. So, loss of a RAID5/6 disk in a single machine is the functional equivalent of loss of a any combination of RAID 0/1 in the same machine. You still need the higher level redundancy as well as a backup plan.
We could start breaking the discussion up into fabric attached vs direct attach RAID vs Software, but I think its sufficient to say, that RAID5/6 doesn't _increase_ the failure surface in any meaningful way when your not using fly-by-night RAID.
Edit: Maybe what your trying to say is that cache flush/FUA operations for a give piece of data don't cover the parity calculation and buffers? That is false, a controller should not be responding to FUA/etc until the entire (including the parity) block has been persisted. So if the controller dies during the operation the host OS is fully aware that the operation didn't complete. The given block is of course left in some unknown state in this case, but that is true of any write operation that fails like this, regardless of WT/WB/RAID/etc.
The biggest problem with raid5 is that it is completely unprotected against silent corruption -- because there is no way for raid to know which data is the corrupted one (and as a result it has to decide whether the parity is correct or not -- though on most raid implementations just ignore silent corruption completely and so the parity is always assumed to be wrong in such cases).
So even if you rebuild an array, a bad drive might've blown away all of your data already. If you were to compare this with ZFS' "raid" Z1 (same parity, different design) you get detection and protection against silent data corruption.
>through what is essentially an all day full tilt stress test is a terrible, terrible idea
The rebuild isn't putting the disks under stress. The sister drive has already failed silently but you only notice this once you start the rebuild. The solution is to check the disks once a week by fully reading every sector.
The normal answer here is to make sure that each side of the RAID10 (RAID01 is something different and much less common) mirror uses drives from a different vendor, thus giving each side a different bathtub curve / failure rate and mitigating the impact of a bad batch. This is a nice advantage over parity-based setups like RAID6 (since replicating this with RAID6 would require finding a unique vendor for each array member, and there are only so many vendors).
For archival purposes, though, you're probably better off with a normal RAID1 + some kind of JBOD setup (like with LVM); striping makes data recovery more difficult should you indeed lose all RAID1 sides of a given member.
You can upgrade a 2-disk RAID1 to a 3-disk RAID5, then chain them to RAID0 as normal. It gives you a better chance to keep data intact, hopefully without lowering the write speed seriously.
I have a lot of places already where recovering from a full wipe is almost not worth it. Customers with 100's of TB of data in 'prosumer' NAS devices that are chock full of regular drives.
Tape sequential throughput is higher than HDD. It's the latency/seek time that gets you. But for a recovery scenario, you can hopefully just do a giant sequential scan.
So does anyone know of a way to take a set of files and write them to an HDD, in NTFS or exFAT format, in a single sequential write? Essentially building the FS on-the-fly (because we're talking about datsets that are much too large to fit into memory)?
Building the MFT first is pretty much what I want to do. But I'm not aware of any utilities that can handle it, nor where to start with writing it myself...
I have a project where I routinely need to copy large amounts of data (3 to 8 TB) to a hard drive. Problem is, my files are all 512kb. So this is much slower than it could be...
If I write it as a single tar file I get excellent throughput, but the users who need to be able to work with the drive are unable to handle a tar file. They need to be able to plug the drive into a Windows computer and have it "just work".. which presents some problems.
That's how you copy an existing filesytem. It's not how you take an arbitrary subset of files and put them into a new filesystem with minimal write amplification.
I was about to buy 3x 12TB Toshiba for Deep Learning datasets; now I need to reconsider... Does anyone know what are the current reliability stats for >10TB drives? My old 6x 4TB HGST in NAS are running without a single problem for the past 3 years...
Excellent. All recent big disks over 8 TB from all vendors have excellent reliability. HGST/WD Helium drives are particularly good; I configured and installed many hundreds from 6 to 12 TB in the past 4 years, and not a single one failed yet.
If you are doing serious Deep/Machine Learning you often get massive datasets, you might need to pre-process substantial portions of them for different models you want to try, you might want to do custom models that are trained e.g. on 4k/8k video footage (think DLSS, self-driving car or drone footage); space gets exhausted pretty quickly. You might also want multiple independent drives, as you might do this all in parallel and it would be too slow to do it on a single drive at the same time, and even RAID with two drives would have a large performance penalty for seeks.
Internet speeds are slow in some places, such as France (specifically the Pays de Gex near Geneva, where my parents live). My dad uses iCloud, but he drives to CERN to upload (he just retired).
I have 18 TB: 5 TB Seagate (x2), 4 TB WesternDigital, 2 TB WesternDigital, and 2 TB internal.
Backups take the most space - I fix laptops for friends from church, and they don't back up but still want their files to be safe. I had to shuffle some files around to free up 650 GB for a recent repair, mostly photos & videos.
Virtual machines use a lot of space too. I made VMWare Fusion images of every Mac OS version 10.5-10.13, Windows 95, 98, 2000, XP, 7, and 10, in several languages ( https://peterburk.github.com/i2018n ), and some Linux distros.
Music, mostly from repaired iPods back in high school, accounts for a lot as well. There's some movies too, though I missed a chance to get 2 TB from a friend because I didn't have enough space at the time. If I upload those, even those that I legally ripped from CDs & DVDs, I'm worried that it'll trigger content filters.
For these, local disks are more useful than cloud services in my opinion.
I'm going to highly recommend looking at drives with WizTree, which does a very fast display of what's taking up space based on parsing the MFT rather than scanning the entire drive.
You may find that there's a massive amount of data where you wouldn't expect it, such as in the Windows Temp directory - if so and it's a bunch of files named "cab_something", you can kill all of those and prevent recurrence with a little housekeeping.
My dad and I created backups locally, mailed each other hard-drives, and we just do a weekly rsync. Storage is large, but network traffic is relatively low.
Having a "backup buddy" you can swap drives with once in a while is never a bad idea. Encrypted backup drives can save your bacon if you're ever caught in a bad situation.
before my current crop of 8TB WD reds, I ran Toshiba enterprise drives and they were extremely reliable for me. none of them failed in a 24x7 hardware raid6 environment after a few years. only replaced them to upgrade capacity.
I've read that Toshiba is based on old HGST process and those drives were extremely reliable, which is why I wanted to get them for the new Deep Learning workstation.
The 15TB drive packs in 1108 Gbit/inch2. That is, each bit is a square of side 8.5nm. This is small but the transistors in flash are smaller [1]. As mind blowing as the numbers (for both technologies) are in the referenced article, that article is now 2.5 years old. Is anyone aware of more recent numbers?
From my experience in the HDD industry I remember that a magnetic bit of data is about 13-15nm long and about 40-60nm wide (narrower tracks for SMR.) The length of a bit is constrained by the grain size of the magnetic media. However, the width of a bit (prior to SMR) is actually constrained by the size of the write head. I don't remember why, but I think it has to do with the fact that the write current is like 40 mA, and the magnetic flux density on the write element is like 1.5T (no that's not a typo)
I'm not an expert on transistor pitches, but here's a chart from Wikipedia for the 10nm - https://en.wikipedia.org/wiki/10_nanometer
It's kind of impressive for HDDs considering that it's a 2 inch long mechanical arm that is able to move with that level of precision.
I am going for the 15tb instead of that 14 so I have extra space for backups. Says no body. We are clearly close to the end of spinning rust, absent some new breakthrough.
HAMR and then HDMR are expected to allow data densities to increase by 5 to 10 times what is currently achievable. HAMR will probably start showing up in a year or two.
Spinning drives are definitely not going away anytime soon unless there is a much more significant drop in the cost of SSDs.
Investment in new spinning drive technologies is going away though. Nobody wants to spend R&D money on coming up with patents and ideas which will be worthless in 5 years when SSD's overtake.
Science investment requires a new technology to have a prospect of a return for most of the ~20 year patent lifespan for it to look like a good investment, and spinning bits of metal aren't that right now.
In the same way that hard drives didn't kill off tape, SSD won't kill off hard drives. The price differential is too great for many applications and they have different operational strengths and weaknesses.
Tapes have a use case that hard drives do not. Tapes are the lowest cost/GB stored and are more shelf stable than hard drives.
SSDs are higher performance than HDDs and have none of the packaging constraints. Flash storage is going to be put into everything and the economies of scale look quite good.
Storage is scaling but the r/w speeds of hdds aren't keeping up. Following the trend line and we see huge hdds that are functionally useless due to how long it takes to do disk operations.
HDDs only exist above tapes because of their performance. And only exist below SSDs due to cost. Tapes are the floor and SSDs are the quickly lowering ceiling. HDDs are likely to be crushed between.
You might be right, but keep in mind that flash has been scaling due to the shrinking semiconductor feature sizes (and additional layers/etc). So a large part of flash's core R&D & production costs are being spread over all the logic being produced. That has been hitting a wall, so while the capacity/price curves for flash look nice, they likely won't continue, which leaves open the possibility that if rust actually gets a 4-5x boost in the near future the current market trends will continue. SSDs for perf/power/size and mechanical harddrives for bulk nearline storage, leaving tape where its been for the past 30 years, as an archival technology.
Horizontal feature sizes for flash memory stopped shrinking years ago. The continued improvements in density and production cost have been the result of R&D that is very specifically focused on 3D NAND flash memory and has little in common with R&D for logic circuit fabrication.
That said, on the horizon of multiple years, I agree that the future scalability of NAND flash doesn't look quite as promising as HAMR/MAMR for hard drives. How that translates into actual product demand and adoption will probably depend on the relatively unexplored question of how much performance per TB our applications actually need. 40+ TB hard drives might not be fast enough to actually serve as nearline storage for that volume of data without eg. multi-actuator technology that essentially gives you more than one hard drive sharing a common spindle motor. Meanwhile, there's no question that QLC NAND flash definitely has adequate read latency and throughput.
Multi-actuator tech sounds interesting, and I wouldn't be surprised to see drives with 5, 10, 50 or 100 read heads per platter at some point.
With 100 read heads per platter, typical seek time is cut by a factor of 100. That won't let them overtake SSD's, but at least allow them to close the gap.
So far, nobody has announced plans to manufacture hard drives with two read heads per platter, so speculating about 100 heads per platter seems rather unrealistic. The multi-actuator technology that is actually being developed by Seagate still has only one read head per platter, but out of the eight or so platters in a drive, the read heads for four of them will be controlled by one actuator and the read heads for the other four platters will be controlled by the second actuator.
Going all the way to 100 read heads per platter would be insanely expensive and would massively increase drive failure rates, while still leaving them about four times slower for random reads than the slowest $35 SSD on the market. This will never turn into a viable product.
You state that with an unwarranted degree of certainty. You're making the same argument, and mistake, that proponents of 'X is going to kill hard drives' have made for decades.
There have been many 'this will be the death of hard drives' technologies over the decades: zip drives, optical drives, tape drives (there was a time when they were predicted to be everywhere... never happened), CD (then DVD) writers, etc. Not to mention MRAM which has been the hottest tech that hasn't really happened yet for 3 decades. These were all going to be some combination of more durable and/or cheaper per Xb. But they all lacked the one critical advantage that hard drives had: massive economies of scale. Here's my prediction: spinning rust isn't going anywhere anytime soon.
Aren't most cloud computing currently almost completely HDDs at this point? At least, this is the impression that I get from looking at the case of Backblaze (which is probably the cloud storage center with the most public statistics on what they use). So far, the usage of SSDs is fairly minimal on the storage side, although I believe they are used at Backblaze some for operations such as bootup and caching. (https://www.backblaze.com/blog/hard-drive-stats-for-q1-2018/)
I would agree on consumer systems (that often have one single boot drive), SSDs make a lot of sense. These days you tend to only see HDDs at the very low end.
From a personal perspective, most of my PCs are completely SSD. But I also have a media server (the largest storage space being reserved for MKV copies of my personal DVDs and Blu Rays). This is composed of RAID arrays of 6TB HDDs. Right now, cost wise, the highest "common" SSD is 4TB SSDs are roughly in the $800-$1200 range. 4TB HDDs by comparison are quite cheap, as low as $89 for a certain Seagate model (I used the Western Digital Reds which for 4TB at the moment are a little higher, $115... for the 6TB model it is $178).
While I definitely wouldn't say "HDDs will always be around", the price difference is very high right now to justify the superior properties of SSD. Backblaze seems to be in agreement here (https://www.backblaze.com/blog/ssd-vs-hdd-future-of-storage/). I guess the question is how long it will take for large capacity SSDs to scale down in price for them to be competitive. Until that happens, I imagine HDDs have a decent future left, if not on consumer devices then at least as drives for cloud / data center storage.
Blackblaze is primarily a backup storage company. Their main bread and butter is relatively cold data storage.
Lots of VPS and similar services offer SSD storage now, and I expect it to grow. The one I use didn't even a non-SSD option, and it was a pretty cheap provider.
SSDs offer orders of magnitude better random IO, I imagine that allows the hosting company to have more clients per storage unit, lowering the effective cost of SSDs.
Seagate and WD are spending mountains of cash to develop technologies like HAMR and MAMR which they expect to take them up to 40TB drives. These technologies require entirely new fabrication processes, etc. Very capital intensive.
Not in the next 10 years. Further than that I don't know. SSD won't be cheaper than HDD per GB in 5 years time, the cost of building Flash and scaling down Flash, Multi Layer Flash, 2.5D / 3D/ 4D Flash ( What ever the manufacture wants to call them ) are also going up. Return of investment is taking longer ( Even with the previous high NAND price ). Moore's Law is dead, and it is not only just CPU.
Theoretically speaking there should be a point where the TCO of NAND Flash would cross HDD. But as NAND scales down, it also reduces its write cycle.
I was never a believer in HAMR, the technologies parent mentioned which Seagate announced in 2012, the idea just seems too unrealistic. The approach WD taking MAMR is much better using Magnetic Field. BPM is also far off, but BPM has been in research for nearly a decade, and MAMR should be here in 2019. All of these R&D will come to fruition in the next few years, where we expect HDD to scale to 100TB in the next 10 years.
Multi-actuator hard drives will not close any performance gap. They will just help slow the decline in IOPS/TB that higher capacity drives bring.
They accomplish this by essentially being multiple hard drives sharing a common spindle and helium-filled enclosure. As Seagate is currently implementing the idea, you still have only one head per platter, and at most one independently moving head per platter (but currently the stack of platters is just divided into two groups). Thus, sequential performance does not improve at all (and actually is reduced by the number of independent actuators), and random I/O increases by a small integer factor when the gap between hard drives and the slowest SSDs is already more than two orders of magnitude. However, power consumption shouldn't be much higher for this kind of multi-actuator hard drive over existing hard drive designs.
Back when I was an intern at Imprimis (before they got bought by Seagate), I worked with the Manufacturing Engineers for the Wren series of 5.25” drives, including the Wren VII, which was the first consumer SCSI drive with a capacity of 1GB or more.
I looked at the drive actuators at the time, and I was incredulous when the guys told me that all operations were serial. I asked why they didn’t do parallel reads and writes, and I was told that technology was already common for mainframes but too expensive for consumer gear.
So, fast forward from 1989 to now, and I’m sure that idea will come back — sooner or later.
Agree. And for a NAS the performance of SSD are unlikely to be required.
In fact at one point I made the mistake of enabling SSD caching on a NAS. The SSD became the bottleneck because of the limitation of SATA, ie one SSD on SATA is slower than 8 or 10 HD in RAID5. So unless you really need very high iops, HD are likely to be good enough.
I'm curious who sold you a NAS with SSD caching that didn't support bypassing the cache for sequential I/O. That's a pretty basic and obvious feature, and it seems like the manufacturer must not have been taking their SSD caching feature seriously if they didn't implement bypassing.
No, it isn't. That's about 6x the price in the US and at least 5x the price in Australia. No combination of VAT and shipping fees can account for that big of a discrepancy. You're probably just looking at a retailer that has inflated the price while they're out of stock. Try getting a quote from someone who has stock in your country ready to ship.
You can't use flash storage for reliable unpowered archival. It degrades (gates leak electrons) over time, unlike magnetic storage. This is also unpredictable, as reliability depends on both operating and power-off temperatures, as well as existing wear. See: https://www.anandtech.com/show/9248/the-truth-about-ssd-data...
Is tape even cheaper than spinning rust? Last time I priced it, the per-GB costs were similar and the tape drives themselves are quite expensive. Tape is a more reliable backup, since the moving parts are not part of the storage, but it's not a cheaper backup.
Depends on your volume. I don't know how much IT departments typically pay, but on Amazon, hard drives start at about $20-25/TB, whereas LTO-7 tapes are about $10-12/TB. That's a significant savings, but you need to be storing at least a few hundred TB before you recoup the cost of the tape drive.
I know Amazon prices aren't exactly perfect for enterprise storage costs, but they imply tape is at least 4x cheaper (assuming you are using enough to amortize the cost of the tape drive):
EDIT: Commenter below points out this 6.25TB tape is actualy 2.5TB physical -- still cheaper, in terms of $/GB, but not the 4x I mention above -- closer to 2x.
LTO-6 (what you linked) is not 6.25TB, it’s 2.5TB, despite what Amazon says.
Then add the operational costs, which is the hard part, because the operational costs for a tape are very different from the operational costs for a hard disk.
That's an apples-to-oranges comparison. The "6.25TB" number for LTO tapes is assuming a fairly arbitrary 2.5:1 lossless compression ratio. The actual data capacity of the tape is only 2.5TB.
You can have an amount of tapes as large as you wish to a single drive, meaning the cost of the tape drive becomes irrelevant after a certain amount of tapes.