dropping off the bus is the best case fail really. Its more annoying when writes become slower than the other disks often causing confusing performance profiles of the overall array. Having good metrics for each disk (we use telegraf) will help flag it early. On my zfs pools, monitoring disk io for each disk, smartmon metrics help tease that out.
For SSDs probably the worst is when there is some firmware bug that triggers on all disks at the same time. e.g. the infamous HP SSD Failure at 32,768 Hours of Use. Yikes!
After getting burned by consumer drives I decided it’s time for a zfs array from used enterprise ssds. Tons of writes on them but full mirrored config and zfs is easier to backup so should be ok. And the really noisy stuff like logging im just sticking into optanes - those are 6+ dwpd depending on model which may as well be unlimited for personal use scenarios
Do you just source these from eBay? Any guidelines for what's a good used enterprise SSD? I had considered this route after I built my ZFS array based on consumer SSDs. The endurance numbers on the enterprise drives are just so much higher.
to be perfectly fair though, this isn't a new failure mode when SSDs arrived on the scene.
drive controllers on HDDs just suddenly go to shit and drop off buses, too.
I guess the difference being that people expect the HDD to fail suddenly whereas with a solid state device most people seem to be convinced that the failure will be graceful.
I don't know how true this is, but it seems to me that SSD firmware has to be more complex than HDD firmware and I've seen far more SSDs die due to firmware failure than HDDs. I've seen HDDs with corrupt firmware (junk strings and nonsense values in the SMART data for example), but usually the drive still reads and writes data. In contrast I've had multiple SSDs, often with relatively low power-on hours, just suddenly die with no warning. Some of them even show up as a completely different (and totally useless) device on the bus. Drives with Sandforce controllers used to do this all of the time, which was a problem because Sandforce hardware was apparently quite affordable and many third party drives used their chips.
I have had a few drives go completely read only on me, which is always a surprise to the underlying OS when it happens. What is interesting is you can't predict when a drive might go read-only on you. I've had a system drive that was only a couple of years old and running on a lightly loaded system claim to have exhausted the write endurance and go read only, although to be fair that drive was a throwaway Inland brand one I got almost for free at Microcenter.
If you really want to see this happen try setting up a Raspberry Pi or similar SBC off of a micro-SD card and leave it running for a couple of years. There is a reason people who are actually serious about those kinds of setups go to great lengths to put the logging on a ramdisk and shut off as much stuff as possible that might touch the disk.
> it seems to me that SSD firmware has to be more complex than HDD firmware
I think they’re complicated in different ways. A hard desk drive has to have an electromagnet powered up in a motor that arm that moves and reads the magnetic balance of the part of the drive under the read head and correlate that to something? Oh, and there are multiple read heads. Seems ridiculously complex!
> I guess the difference being that people expect the HDD to fail suddenly whereas with a solid state device most people seem to be convinced that the failure will be graceful.
This is exactly the opposite of my lived experience. Spinners fail more often than SSDs, but I don't remember any sudden failures with spinners, as far as I can recall, they all have pre-failure indicators, like terrible noises (doesn't help for remote disks), SMART indicators, failed read/write on a couple sectors here and there, etc. If you don't have backups, but you notice in a reasonable amount of time, you can salvage most of your data. Certainly, sometimes the drives just won't spin up because of a bearing/motor issue; but sometimes you can rotate the drive manually to get it started and capture some data.
The vast majority of my SSD failures have been disappear from the bus; lots of people say they should fail read only, but I've not seen it. If you don't have backups, your data is all gone.
Perhaps I missed the pre-failure indicators from SMART, but it's easier when drives fail but remain available for inspection --- look at a healthy drive, look at a failed drive, see what's different, look at all your drives, predict which one fails next. For drives that disappear, you've got to read and collect the stats regularly and then go back and see if there was anything... I couldn't find anything particularly predictive. I feel disappear from the bus is more in the firmware error category vs physical storage problem, so there may not be real indications, unless it's a power on time based failure...
For what it is worth the SMART diagnostics and health indicators have rarely been useful for me, either on SSDs or HDDs. I don't think I've ever had a SMART health warning before a drive dies. Although I did have one drive that gave a "This drive is on DEATH'S DOOR! Replace it IMMEDIATELY!" error for 3 years before I finally got around to replacing it, mostly to avoid having my OS freak out every time it booted up.
We have a fleet of few hundred HDDs that is basically being replaced "on next failure" with SSD and that is BY FAR rarer on HDDs, maybe one out of 100 "just dies".
Usually it either starts returning media errors, or slows down (and if it is not replaced in time, slowing down drive usually turns into media error one).
SSDs (at least a big fleet of samsung ones we had) are much worse, just off, not even turning readonly. Of course we have redundancy so it's not really a problem, but if same happened on someone's desktop they'd be screwed if they don't have backups.
I don't think one should worry as much about what medias they are backing up to as if they are answering the question "does my data resiliency match my retention needs".
And regularly test restores actually work, nothing worse than thinking you had backups and then they don't restore right.
Text is wrong about CRCs: everyone uses pretty heavy ECC, so it's not just a re-read. This also provides a somewhat graduated measure of the block's actual health, so the housekeeping firmware can decide whether to stop using the block (ie, move the content elsewhere).
I'm also not a fan of buy bigger storage concept, or the conspiracy-theory on 480 v 512.
It sure would be nice if when considering a product, you could just look at some claimed stats from the vendor about time-related degradation, firmware sparing policy, etc. we shouldn't have to guess!
> I'm also not a fan of buy bigger storage concept, or the conspiracy-theory on 480 v 512.
I don't understand why this is being called a "conspiracy theory"; but, if you want some very concrete evidence that this is how they work, a paper was recently published that analyzed the behavior and endurance of various SSDs, and the data would be very difficult to describe using any other theory than that, comparing apples-to-apples on drives that have better write endurance, they are merely overprovisioned to allow the wear-level algorithm to not cause as much write amplification while reorganizing.
> OP on write-intensive SSD. SSD vendors often offer two versions of SSDs with similar hardware specifications, where the lower-capacity model is typically marketed as “write-optimized” or “mixed-use”. One might expect that such write-optimized SSDs would demonstrate improved WAF characteristics due to specialized internal designs. To investigate this, we compared two Micron SSD models: the Micron 7450 PRO, designed for “read-intensive” workloads with a capacity of 960 GB, and the Micron 7450 MAX, intended for “mixed-use” workloads with a capacity of 800 GB. Both SSDs were tested under identical workloads and dataset sizes, as shown in Figure 7b. The WAF results for both models were identical and closely matched the results from the simulator. This suggests that these Micron SSDs, despite being marketed for different workloads, are essentially identical in performance, with the only difference being a larger OP on the “mixed-use” model. For these SSD models, there appear to be no other hardware or algorithmic improvements. As a result, users can achieve similar performance by manually reserving free space on the “read-intensive” SSD, offering a practical alternative to purchasing the “mixed-use” model.
- Consumer drives like Samsung 980 Pro and WD SN 850 Black use TLC as SLC when about 30+% of the drive is erased. At this time you a burst write a bit less than 10% of the drive capacity at 5 GB/s. After that, it slows remarkably. If the filesystem doesn’t automatically trim free space, the drive will eventually be stuck in slow mode all the time.
- Write amplification factor (WAF) is not discussed. Random small writes and partial block deletions will trigger garbage collection, which ends up rewriting data to reclaim freed space in a NAND block.
- A drive with a lot of erased blocks can endure more TBW than one that has all user blocks with data. This is because garbage collection can be more efficient. Again, enable TRIM on your fs.
- Overprovisioning can be used to increase a drive’s TBW. If before you write to your 0.3 DWPD 1024 GB drive, you partition it so you use only 960 GB, you now have a 1 DWPD drive.
- per the NVMe spec there are indicators of drive health in the SMART log page.
- Almost all current datacenter or enterprise drives support an OCP SMART log page. This allows you to observe things like the write amplification factor (WAF), rereads due to ECC errors, etc.
You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.
> - Consumer drives like Samsung 980 Pro and WD SN 850 Black use TLC as SLC when about 30+% of the drive is erased. At this time you a burst write a bit less than 10% of the drive capacity at 5 GB/s. After that, it slows remarkably. If the filesystem doesn’t automatically trim free space, the drive will eventually be stuck in slow mode all the time.
This is true, but despite all of the controversy about this feature it’s hard to encounter this in practical consumer use patterns.
With the 980 Pro 1TB you can write 113GB before it slows down. (Source https://www.techpowerup.com/review/samsung-980-pro-1-tb-ssd/... ) So you need to be able to source that much data from another high speed SSD and then fill nearly 1/8th of the drive to encounter the slowdown. Even when it slows down you’re still writing at 1.5GB/sec. Also remember that the drive is factory overprovisioned so there is always some amount of space left to handle some of this burst writing.
For as much as this fact gets brought up, I doubt most consumers ever encounter this condition. Someone who is copying very large video files from one drive to another might encounter it on certain operations, but even in slow mode you’re filling the entire drive capacity in under 10 minutes.
> You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.
I think it is safe to say that all drives have this. Refer to the available spare field in the SMART log page (likely via smartctl -a) to see the percentage of factory overprovisioned blocks that are still available.
I hypothesize that as this OP space dwindles writes get slower because they are more likely to get bogged down behind garbage collection.
> I doubt most consumers ever encounter this condition. Someone who is copying very large video files from one drive to another might encounter it on certain operations
I agree. I agree so much that I question the assertion that drive slowness is a major factor in machines feeling slow. My slow laptop is about 5 years old. Firefox spikes to 100+% CPU for several seconds on most page loads. The drive is idle during that time. I place the vast majority of the blame on software bloat.
That said, I am aware of credible assertions that drive wear has contributed to measurable regression in VM boot time for a certain class of servers I’ve worked on.
> You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.
This has always been the case, thus why even a decade ago the “pro” drives were odd sizes like 120g vs 128g.
Products like that still exist today and the problem tends to show up as drives age and that pool shrinks.
DWPD and TB written like modern consumer drives use are just different ways of communicating that contract.
FWIW I’d you do a drive wide discard and then only partition 90% of the drive you can dramatically improve the garbage collection slowdown on consumer drives.
In the world of ML and containers you can hit that if you say have fstrim scheduled once a week to avoid the cost of online discards.
I would rather have visibility into the size of the reserve space through smart, but I doubt that will happen.
dropping off the bus is the best case fail really. Its more annoying when writes become slower than the other disks often causing confusing performance profiles of the overall array. Having good metrics for each disk (we use telegraf) will help flag it early. On my zfs pools, monitoring disk io for each disk, smartmon metrics help tease that out. For SSDs probably the worst is when there is some firmware bug that triggers on all disks at the same time. e.g. the infamous HP SSD Failure at 32,768 Hours of Use. Yikes!
we had ones that turned into that failure mode at like 80% life left. Zero negative SMART metrics, just slowed down.
My hunch is that they don't expose anything because that makes it harder to refund on warranty
After getting burned by consumer drives I decided it’s time for a zfs array from used enterprise ssds. Tons of writes on them but full mirrored config and zfs is easier to backup so should be ok. And the really noisy stuff like logging im just sticking into optanes - those are 6+ dwpd depending on model which may as well be unlimited for personal use scenarios
Do you just source these from eBay? Any guidelines for what's a good used enterprise SSD? I had considered this route after I built my ZFS array based on consumer SSDs. The endurance numbers on the enterprise drives are just so much higher.
Yeah ebay. In general I've found buying enterprise stuff off ebay to be quite safe...all the shysters are in consumer space
>Any guidelines for what's a good used enterprise SSD?
Look at the sellers other items. You want them to be data-center stuff.
Look at how many they have to sell - someone clearing out a server won't have 1 drive, they'll have half a dozen plus.
Look for smart data in the post / guaranteed minimum health.
I mostly bought S3500/3600/3700 series intel SSDs. The endurance numbers vary so you'll need to look up what you find
>The endurance numbers on the enterprise drives are just so much higher.
That plus I'm more confident they'll actually hit them
The most common catastrophic failure you’ll see in SSDs: the entire drive simply drops off the bus as though it were no longer there.
Happened to me last week.
I just put it in a plastic bag into the freezer during 15 minutes, and works.
I made a copy to my laptop and then install a new server.
But not always works like charms.
Please always have a backup for documents, and a recent snapshot for critical systems.
to be perfectly fair though, this isn't a new failure mode when SSDs arrived on the scene.
drive controllers on HDDs just suddenly go to shit and drop off buses, too.
I guess the difference being that people expect the HDD to fail suddenly whereas with a solid state device most people seem to be convinced that the failure will be graceful.
I don't know how true this is, but it seems to me that SSD firmware has to be more complex than HDD firmware and I've seen far more SSDs die due to firmware failure than HDDs. I've seen HDDs with corrupt firmware (junk strings and nonsense values in the SMART data for example), but usually the drive still reads and writes data. In contrast I've had multiple SSDs, often with relatively low power-on hours, just suddenly die with no warning. Some of them even show up as a completely different (and totally useless) device on the bus. Drives with Sandforce controllers used to do this all of the time, which was a problem because Sandforce hardware was apparently quite affordable and many third party drives used their chips.
I have had a few drives go completely read only on me, which is always a surprise to the underlying OS when it happens. What is interesting is you can't predict when a drive might go read-only on you. I've had a system drive that was only a couple of years old and running on a lightly loaded system claim to have exhausted the write endurance and go read only, although to be fair that drive was a throwaway Inland brand one I got almost for free at Microcenter.
If you really want to see this happen try setting up a Raspberry Pi or similar SBC off of a micro-SD card and leave it running for a couple of years. There is a reason people who are actually serious about those kinds of setups go to great lengths to put the logging on a ramdisk and shut off as much stuff as possible that might touch the disk.
> it seems to me that SSD firmware has to be more complex than HDD firmware
I think they’re complicated in different ways. A hard desk drive has to have an electromagnet powered up in a motor that arm that moves and reads the magnetic balance of the part of the drive under the read head and correlate that to something? Oh, and there are multiple read heads. Seems ridiculously complex!
> I guess the difference being that people expect the HDD to fail suddenly whereas with a solid state device most people seem to be convinced that the failure will be graceful.
This is exactly the opposite of my lived experience. Spinners fail more often than SSDs, but I don't remember any sudden failures with spinners, as far as I can recall, they all have pre-failure indicators, like terrible noises (doesn't help for remote disks), SMART indicators, failed read/write on a couple sectors here and there, etc. If you don't have backups, but you notice in a reasonable amount of time, you can salvage most of your data. Certainly, sometimes the drives just won't spin up because of a bearing/motor issue; but sometimes you can rotate the drive manually to get it started and capture some data.
The vast majority of my SSD failures have been disappear from the bus; lots of people say they should fail read only, but I've not seen it. If you don't have backups, your data is all gone.
Perhaps I missed the pre-failure indicators from SMART, but it's easier when drives fail but remain available for inspection --- look at a healthy drive, look at a failed drive, see what's different, look at all your drives, predict which one fails next. For drives that disappear, you've got to read and collect the stats regularly and then go back and see if there was anything... I couldn't find anything particularly predictive. I feel disappear from the bus is more in the firmware error category vs physical storage problem, so there may not be real indications, unless it's a power on time based failure...
For what it is worth the SMART diagnostics and health indicators have rarely been useful for me, either on SSDs or HDDs. I don't think I've ever had a SMART health warning before a drive dies. Although I did have one drive that gave a "This drive is on DEATH'S DOOR! Replace it IMMEDIATELY!" error for 3 years before I finally got around to replacing it, mostly to avoid having my OS freak out every time it booted up.
We have a fleet of few hundred HDDs that is basically being replaced "on next failure" with SSD and that is BY FAR rarer on HDDs, maybe one out of 100 "just dies".
Usually it either starts returning media errors, or slows down (and if it is not replaced in time, slowing down drive usually turns into media error one).
SSDs (at least a big fleet of samsung ones we had) are much worse, just off, not even turning readonly. Of course we have redundancy so it's not really a problem, but if same happened on someone's desktop they'd be screwed if they don't have backups.
Always make backups to HDD and cloud (and possibly tape if you are a data nut).
I don't think one should worry as much about what medias they are backing up to as if they are answering the question "does my data resiliency match my retention needs".
And regularly test restores actually work, nothing worse than thinking you had backups and then they don't restore right.
Text is wrong about CRCs: everyone uses pretty heavy ECC, so it's not just a re-read. This also provides a somewhat graduated measure of the block's actual health, so the housekeeping firmware can decide whether to stop using the block (ie, move the content elsewhere).
I'm also not a fan of buy bigger storage concept, or the conspiracy-theory on 480 v 512.
It sure would be nice if when considering a product, you could just look at some claimed stats from the vendor about time-related degradation, firmware sparing policy, etc. we shouldn't have to guess!
> I'm also not a fan of buy bigger storage concept, or the conspiracy-theory on 480 v 512.
I don't understand why this is being called a "conspiracy theory"; but, if you want some very concrete evidence that this is how they work, a paper was recently published that analyzed the behavior and endurance of various SSDs, and the data would be very difficult to describe using any other theory than that, comparing apples-to-apples on drives that have better write endurance, they are merely overprovisioned to allow the wear-level algorithm to not cause as much write amplification while reorganizing.
https://news.ycombinator.com/item?id=44985619
> OP on write-intensive SSD. SSD vendors often offer two versions of SSDs with similar hardware specifications, where the lower-capacity model is typically marketed as “write-optimized” or “mixed-use”. One might expect that such write-optimized SSDs would demonstrate improved WAF characteristics due to specialized internal designs. To investigate this, we compared two Micron SSD models: the Micron 7450 PRO, designed for “read-intensive” workloads with a capacity of 960 GB, and the Micron 7450 MAX, intended for “mixed-use” workloads with a capacity of 800 GB. Both SSDs were tested under identical workloads and dataset sizes, as shown in Figure 7b. The WAF results for both models were identical and closely matched the results from the simulator. This suggests that these Micron SSDs, despite being marketed for different workloads, are essentially identical in performance, with the only difference being a larger OP on the “mixed-use” model. For these SSD models, there appear to be no other hardware or algorithmic improvements. As a result, users can achieve similar performance by manually reserving free space on the “read-intensive” SSD, offering a practical alternative to purchasing the “mixed-use” model.
This article misses several important points.
- Consumer drives like Samsung 980 Pro and WD SN 850 Black use TLC as SLC when about 30+% of the drive is erased. At this time you a burst write a bit less than 10% of the drive capacity at 5 GB/s. After that, it slows remarkably. If the filesystem doesn’t automatically trim free space, the drive will eventually be stuck in slow mode all the time.
- Write amplification factor (WAF) is not discussed. Random small writes and partial block deletions will trigger garbage collection, which ends up rewriting data to reclaim freed space in a NAND block.
- A drive with a lot of erased blocks can endure more TBW than one that has all user blocks with data. This is because garbage collection can be more efficient. Again, enable TRIM on your fs.
- Overprovisioning can be used to increase a drive’s TBW. If before you write to your 0.3 DWPD 1024 GB drive, you partition it so you use only 960 GB, you now have a 1 DWPD drive.
- per the NVMe spec there are indicators of drive health in the SMART log page.
- Almost all current datacenter or enterprise drives support an OCP SMART log page. This allows you to observe things like the write amplification factor (WAF), rereads due to ECC errors, etc.
You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.
> - Consumer drives like Samsung 980 Pro and WD SN 850 Black use TLC as SLC when about 30+% of the drive is erased. At this time you a burst write a bit less than 10% of the drive capacity at 5 GB/s. After that, it slows remarkably. If the filesystem doesn’t automatically trim free space, the drive will eventually be stuck in slow mode all the time.
This is true, but despite all of the controversy about this feature it’s hard to encounter this in practical consumer use patterns.
With the 980 Pro 1TB you can write 113GB before it slows down. (Source https://www.techpowerup.com/review/samsung-980-pro-1-tb-ssd/... ) So you need to be able to source that much data from another high speed SSD and then fill nearly 1/8th of the drive to encounter the slowdown. Even when it slows down you’re still writing at 1.5GB/sec. Also remember that the drive is factory overprovisioned so there is always some amount of space left to handle some of this burst writing.
For as much as this fact gets brought up, I doubt most consumers ever encounter this condition. Someone who is copying very large video files from one drive to another might encounter it on certain operations, but even in slow mode you’re filling the entire drive capacity in under 10 minutes.
> You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.
I think it is safe to say that all drives have this. Refer to the available spare field in the SMART log page (likely via smartctl -a) to see the percentage of factory overprovisioned blocks that are still available.
I hypothesize that as this OP space dwindles writes get slower because they are more likely to get bogged down behind garbage collection.
> I doubt most consumers ever encounter this condition. Someone who is copying very large video files from one drive to another might encounter it on certain operations
I agree. I agree so much that I question the assertion that drive slowness is a major factor in machines feeling slow. My slow laptop is about 5 years old. Firefox spikes to 100+% CPU for several seconds on most page loads. The drive is idle during that time. I place the vast majority of the blame on software bloat.
That said, I am aware of credible assertions that drive wear has contributed to measurable regression in VM boot time for a certain class of servers I’ve worked on.
> You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.
This has always been the case, thus why even a decade ago the “pro” drives were odd sizes like 120g vs 128g.
Products like that still exist today and the problem tends to show up as drives age and that pool shrinks.
DWPD and TB written like modern consumer drives use are just different ways of communicating that contract.
FWIW I’d you do a drive wide discard and then only partition 90% of the drive you can dramatically improve the garbage collection slowdown on consumer drives.
In the world of ML and containers you can hit that if you say have fstrim scheduled once a week to avoid the cost of online discards.
I would rather have visibility into the size of the reserve space through smart, but I doubt that will happen.