← Back to Logs

How SSDs Actually Work

Try the interactive lab for this articleTake the quiz (6 questions · ~5 min)

There is a comforting fiction that programmers carry in their heads about storage. Your filesystem asks the kernel to write some bytes at logical block address 4096. The kernel sends a command to the drive. The drive writes the bytes at LBA 4096. Next time you read LBA 4096, you get the same bytes back. Simple, mechanical, predictable.

For a hard drive this is roughly correct. For an SSD it is nearly a lie. An SSD is a small computer with its own CPU, its own DRAM, its own complex firmware, and a physical storage medium (NAND flash) that does not support the operations the operating system thinks it does. Every write command is rewritten under the hood. Every read goes through a translation layer. Every deletion triggers background work you cannot see. The drive you think you are writing to and the drive that actually exists are only loosely related.

This article explains how SSDs actually work, starting from the physics of a flash cell and building up through the Flash Translation Layer to the observable behaviours that cause so much confusion in practice. The goal is to give you a mental model accurate enough that performance cliffs, endurance limits, and the strange behaviour of consumer drives under pressure all make sense.

What a Flash Cell Is

Every NAND flash cell is a transistor with a twist. It has a normal source, drain, and control gate, plus an extra floating gate sandwiched between the control gate and the channel, insulated on all sides by a thin layer of silicon dioxide. Electrons trapped on the floating gate modify the transistor's threshold voltage: the gate voltage at which current starts flowing from source to drain.

You write a bit by forcing electrons through the insulator and onto the floating gate. This is done using Fowler-Nordheim tunnelling, a quantum-mechanical effect that lets electrons pass through a thin oxide barrier when the electric field across it is strong enough. Applying a high voltage (typically 15 to 20 volts) to the control gate pulls electrons up onto the floating gate. Once there, the insulator traps them, and the threshold voltage of the transistor shifts. No power is needed to retain the charge; the electrons stay for years.

To read the cell, the drive applies a series of gate voltages and watches whether current flows. A cell with no stored charge has a low threshold voltage: even a small gate voltage turns on the channel. A cell with trapped electrons has a higher threshold voltage: it takes more gate voltage to turn the channel on. By sweeping the gate voltage and recording at which point the current switches on, the drive decodes the stored charge back into bits.

The number of bits per cell depends on how finely the drive can discriminate between charge levels.

  • SLC (Single-Level Cell): one bit per cell. Two states (erased, programmed). Fastest, longest-lived, most expensive. Used in industrial and enterprise SSDs.
  • MLC (Multi-Level Cell): two bits per cell. Four states. Good balance of endurance and density. Mostly gone from consumer drives since 2019.
  • TLC (Triple-Level Cell): three bits per cell. Eight states. The dominant consumer flash in Europe today. Cheap per gigabyte, respectable endurance.
  • QLC (Quad-Level Cell): four bits per cell. Sixteen states. Cheapest per gigabyte, lowest endurance. Common in budget consumer SSDs since around 2020.
  • PLC (Penta-Level Cell): five bits per cell. Thirty-two states. Demonstrated in research but not shipping at scale. The economics are marginal.

Each added bit per cell halves the voltage margin between adjacent states. SLC has one boundary to detect. TLC has seven. QLC has fifteen. Smaller margins mean more sensitivity to noise, temperature, retention drift, and read disturb. This directly translates to lower endurance: an SLC cell can typically survive around 100,000 program/erase cycles, TLC around 3,000, QLC around 1,000. High-end modern 3D NAND has pushed TLC endurance up with thicker cells and more layers, but the ordering is unchanged.

3D NAND: Going Vertical

The original NAND flash was a flat grid of cells laid out on a two-dimensional plane. Every capacity jump came from shrinking the cell size. By 2013, the industry had run into a physical wall: below about 15 nm feature size, the number of electrons on a single floating gate dropped to a few dozen, and cell-to-cell interference became unmanageable.

The answer was 3D NAND: stacking cells vertically. Instead of packing cells more tightly in the plane, manufacturers etched deep holes through a stack of alternating conductive and insulating layers, then deposited a charge-trap layer inside each hole. Each layer in the stack corresponds to one row of cells. A single vertical string passes through dozens or hundreds of layers, each layer adding more cells without shrinking any individual one.

Samsung shipped the first commercial 3D NAND (called V-NAND) in 2013 with 24 layers. By 2026, production parts from Samsung, Micron, SK Hynix, Kioxia, and YMTC run in the 230 to 320 layer range. Each added layer is essentially free density: the same silicon area holds more bits without any sub-nanometre lithography.

The cells themselves can be larger than the equivalent 2D node, which actually improves endurance and retention. Modern 3D TLC cells are more robust than 2D MLC cells from a decade earlier. This is why a €60 consumer SSD in 2026 has endurance and reliability that would have required a €400 enterprise SSD in 2016.

The trade-off is etching quality. A 300-layer stack has deep, narrow holes that must be uniformly etched. Variation in hole diameter along the depth of the stack creates variation in cell behaviour, so drives have to do per-layer calibration during manufacturing and ongoing compensation during operation. Bit error rates tend to be worst at the top and bottom of the stack and best in the middle.

Pages, Blocks, and the Asymmetry Problem

The central oddity of NAND flash is its access granularity. The unit you can read from is a page. The unit you can write to is also a page. But the unit you must erase before you can write again is a block, which contains many pages. You cannot erase a single page. You cannot rewrite a page in place. To change even one byte, the entire containing block has to be erased first, then all the unaffected pages have to be copied back, and only then can the modified page be written.

Typical sizes in modern 3D NAND:

  • Page: 16 KiB (sometimes 8 KiB or 4 KiB on older flash, up to 32 KiB on some QLC).
  • Block: 256 to 1024 pages, so 4 MiB to 32 MiB of data per block.
  • Plane: many blocks operating semi-independently for parallelism.
  • Die: several planes on a single piece of silicon.
  • Package: several dies stacked in one chip.

The asymmetry produces a very specific problem. Say the filesystem wants to update a single 4 KiB block. The drive cannot rewrite the old physical page. It has to either erase the containing 16 MiB NAND block (too expensive for every write) or write the new data somewhere else and mark the old location invalid.

SSDs take the second option. Every write goes to a new, freshly erased page. The old page becomes "invalid" but stays physically present until garbage collection reclaims its containing block. This is the fundamental behaviour of every modern flash drive, and the entire Flash Translation Layer exists to hide it.

The Flash Translation Layer

The Flash Translation Layer (FTL) is firmware running on the SSD controller. Its job is to make NAND flash look like a conventional block device to the host: a linear array of sectors you can read and write at will. It does this by maintaining a mapping from logical block addresses (LBAs, what the OS asks for) to physical pages on the NAND (where the data actually lives).

A typical FTL maintains a page-level mapping table in DRAM on the controller. Each entry is roughly 4 bytes and maps one 4 KiB LBA to one physical page. For a 1 TB SSD with 4 KiB LBAs, that is about 1 GB of mapping table. This is why consumer SSDs have DRAM cache chips visible on their PCBs and why DRAM-less SSDs (which keep the table in a small SRAM plus host memory via HMB) exist as a cheaper option.

When the OS writes LBA 12345:

  1. The FTL allocates a fresh physical page from a pool of pre-erased pages.
  2. It writes the new data to that page.
  3. It updates the mapping table: LBA 12345 → new physical page.
  4. It marks the old physical page as invalid.

When the OS reads LBA 12345:

  1. The FTL looks up the mapping: LBA 12345 → physical page X.
  2. It reads physical page X from NAND.
  3. It returns the data to the host.

From the host's perspective, the LBA is stable. Reading the same LBA always returns the last-written data, until you write something else. From the NAND's perspective, data moves around constantly. There is no permanent assignment of an LBA to any specific physical page.

This is the key mental model shift. The LBA the OS uses is a logical identifier, not a physical location. The drive maintains a mapping that changes on every write.

DRAM-less and HMB

DRAM-less SSDs have become common at the low end of the consumer market. Instead of a dedicated DRAM chip on the PCB holding the full page mapping table, they use a small amount of on-controller SRAM (a few megabytes) as a cache for recently accessed parts of the table, with the bulk of the table stored in NAND or borrowed from host memory.

The NVMe specification includes a feature called Host Memory Buffer (HMB). The drive asks the host for a small chunk of RAM (typically 16 to 64 MiB) at initialisation time. The host allocates that memory and gives the drive a DMA-accessible pointer to it. The drive uses this memory as an extension of its own mapping cache.

HMB works remarkably well for sequential workloads: most accesses fall into a small working set of mapping entries, and the cached portion stays hot. It works less well for truly random access across a large drive, where every read can miss the cache and force a NAND lookup for the mapping itself. This is why DRAM-less drives often show significantly lower random read IOPS than DRAM-backed drives at the same price point.

DRAM-less drives are cheaper to build (no DRAM chip, simpler PCB) and more power-efficient (DRAM draws constant current even when idle). The performance penalty has shrunk as HMB implementations matured, to the point where most consumer laptops in the 2026 entry segment ship with DRAM-less drives that feel perfectly fine for normal workloads.

Garbage Collection

Over time, invalid pages accumulate. Each time the OS overwrites an LBA, the previous physical page becomes invalid. Eventually blocks are full of mostly-invalid pages with a few still-valid ones scattered throughout. At some point the drive needs to reclaim that space.

Reclamation works by picking a victim block with a high invalid ratio, copying its remaining valid pages to a fresh block, and then erasing the victim. The erase leaves an empty block ready for new writes. This is garbage collection (GC), and it runs continuously in the background on every modern SSD.

The cost of GC is what makes SSD performance complicated. Every valid page that has to be copied is an extra write that the host did not ask for. This is called write amplification, and it is the most important number you have never heard of.

Write Amplification Factor (WAF) is the ratio of actual NAND writes to host writes. A WAF of 1.0 means every byte the host writes causes exactly one byte of NAND wear. A WAF of 3.0 means three bytes of wear per host byte, so the drive's endurance is one third of the raw NAND endurance. Consumer drives typically run at WAF between 1.1 and 5 depending on workload; enterprise drives are tuned for WAF close to 1.

The formula for write amplification under a random-write workload is illuminating:

WAF ≈ 1 / (1 - (1 - OP)^α)

Where OP is the overprovisioning ratio (spare capacity as a fraction of user capacity) and α depends on the specific workload. With 7% overprovisioning (typical consumer), random workloads can produce WAF of 5 or more. With 28% overprovisioning (typical enterprise), the same workload might produce WAF of 2. This is why "enterprise" SSDs are often physically identical to consumer drives but configured with more spare capacity.

GC Foreground vs Background

Garbage collection runs in two modes: background and foreground. Background GC runs when the drive is idle, opportunistically freeing up blocks before the host asks for them. This is why leaving an SSD powered on for an hour after a heavy write burst noticeably improves its subsequent write performance: background GC has had time to prepare fresh blocks.

Foreground GC kicks in when the pool of free blocks drops below a threshold while the host is actively writing. The drive has no choice but to interleave GC work with host writes, which means every host write becomes slower (the controller has to pause and do some GC before the write can proceed). This is the main cause of the write-performance collapse you see on a nearly-full drive that has been hammered with random writes: the drive is not slow because the flash is full, it is slow because every write triggers synchronous GC work.

Smart drives tune their GC aggressiveness based on the recent workload. A drive that sees continuous heavy writes keeps a larger buffer of free blocks ready. A drive that sees mostly reads lets its free block pool shrink to give back SLC cache space. The tuning happens automatically and is invisible to the host.

One consequence: SSDs are not truly idle when the host stops sending commands. They are running GC, monitoring read disturb counts, doing retention scrubbing, and rebalancing wear. Power draw of a modern consumer SSD at idle can be 500 mW to 1 W, noticeably more than the 50 mW of a hard drive in standby. NVMe introduces low-power states (PS3, PS4) that suspend some of this activity at the cost of wakeup latency, but a drive that never sleeps will eat battery life on a laptop.

Overprovisioning and Spare Area

Overprovisioning is the difference between the drive's raw NAND capacity and the capacity it exposes to the host. A 1 TB consumer SSD might have 1.024 TiB of actual NAND (1099 billion bytes) but expose only 1.0 TB (1000 billion bytes), giving around 7% overprovisioning. An enterprise drive might have the same raw capacity but expose only 800 GB, giving 28%.

The spare capacity is used for:

  • Fresh blocks to write to before GC completes.
  • Replacement blocks for worn-out cells.
  • Metadata storage (mapping tables, wear-levelling state, error correction parity).
  • Write buffering during garbage collection.

Users can add their own overprovisioning by leaving part of the drive unallocated. If you buy a 1 TB consumer SSD and partition it as 900 GB, the unallocated 100 GB acts as extra spare area (once TRIM has been issued for the unused region). This is a common tactic for workloads that hit the consumer drive's GC hard, trading user capacity for write endurance and sustained write performance.

Wear Levelling

NAND cells have a finite number of program/erase cycles before they fail. If the FTL repeatedly programmed and erased the same physical block (for example, because the OS kept updating the same LBA), that block would wear out quickly while the rest of the drive remained pristine. Wear levelling is the set of algorithms that spread wear evenly across the whole drive.

Two kinds of wear levelling exist:

Dynamic wear levelling operates on data that is already being written. When allocating a fresh block for new data, the FTL picks the least-worn free block instead of any random one. This naturally spreads writes across the array.

Static wear levelling moves cold data (data that has not been written for a long time) to high-wear blocks to free up low-wear blocks for hot data. Without static wear levelling, an SSD that stores mostly unchanging data (for example, a Linux root filesystem) would concentrate all its wear on a small region repeatedly reused for updates, wearing that region out while most blocks sit unused. Periodic static wear levelling keeps the drive healthy.

Wear levelling is why you cannot tell the drive "store this file on blocks 1000-2000". Every write lands wherever the FTL decides is best for wear distribution. It also means the physical location of any given byte drifts over time, even if the LBA does not change.

Write Amplification in Practice

A few concrete examples make write amplification tangible.

Sequential workload. Suppose you write 100 GB of data sequentially to a fresh drive. The FTL packs writes densely into fresh blocks. GC has no work to do because all the pages in any given block were written together and become invalid together (when they are all replaced by some future overwrite). WAF stays close to 1.0. Drive endurance looks great.

Random 4 KiB workload. Suppose you write 100 GB worth of 4 KiB random updates, hitting LBAs uniformly across a 1 TB drive. Every write lands in a fresh page, invalidating an old one. Invalid pages are scattered throughout every block. GC has to keep copying valid pages around to free up whole blocks. On a consumer drive with 7% overprovisioning, the effective WAF can climb to 5 or more. A nominally 600 TBW drive would only handle 120 TB of actual host writes before wearing out.

Mixed workload. Real filesystems produce a mix: large sequential writes for media files, random writes for databases, small metadata updates for every operation. WAF depends heavily on the mix. Filesystems that try to be SSD-friendly batch writes, avoid small random updates, and maintain alignment with the drive's internal page boundaries. ext4 with barrier=1 and default mount options is typically fine. Database workloads with fsync-heavy commits are harder, which is why enterprise SSDs aimed at database work have high overprovisioning and fast random write paths.

The Queue Depth Multiplier

One of the biggest performance tricks in SSDs is parallelism. Modern drives contain many NAND dies (8 to 64 depending on capacity), connected by multiple channels to the controller. Each channel can operate independently: a 4-channel drive can read from four different dies simultaneously. A 16-channel enterprise drive can have 16 reads or writes in flight at once.

This is why queue depth matters so much on SSDs. A single synchronous read at a time leaves most of the drive sitting idle; the controller can only keep one channel busy. A workload with queue depth 32 keeps all channels active and pipelines commands across dies, dramatically increasing throughput. The difference between QD=1 and QD=32 on a modern NVMe drive can be a 10x speedup for random reads.

NVMe makes this parallelism easy to expose. The NVMe command queue architecture supports up to 65,535 I/O queues, each with up to 65,535 outstanding commands. A multi-threaded application can fan out work to keep dozens of commands in flight simultaneously. The old SATA AHCI interface had a single 32-entry queue (Native Command Queuing), which was perfectly adequate for mechanical disks but hopelessly bottlenecked for SSDs. Moving from SATA SSDs to NVMe SSDs gave real-world performance jumps of 3 to 5x on workloads that could exploit the deeper queue.

The catch is that queue depth only helps if the application can supply it. A single thread doing synchronous reads one at a time will see SATA-level performance even on the fastest NVMe drive. Libraries like libaio, io_uring (the modern favourite on Linux), and Windows IOCP exist specifically to let applications pump commands into the queue without blocking.

The SLC Cache Trick

Consumer TLC and QLC drives use a trick to make their advertised peak speeds look much better than sustained performance: they cache writes in a region of the NAND operating in SLC mode.

Here is how it works. Any NAND cell can be programmed as SLC if you only use two of its possible states instead of eight. An SLC-mode cell is faster to program, faster to read, and much more reliable. The drive reserves some fraction of its flash (often 10 to 50 GB on a 1 TB consumer drive) to operate in SLC mode. When the host writes data, the controller directs it first to the SLC cache. Later, during idle time, the controller folds the SLC-cached data down into the slower TLC/QLC storage, freeing up the SLC cache for new writes.

As long as your workload stays small enough to fit in the SLC cache, write speeds look spectacular. The first few GB of a large file copy land in SLC and fly. Eventually the cache fills up. From that point, writes go directly to TLC/QLC, and speeds collapse to the native speed of the slower flash. The classic benchmark pattern is a fast-then-cliff write curve: 3 GB/s for the first 20 seconds, then sustained 500 MB/s until the transfer is done.

This is why consumer SSDs advertise "up to 7,000 MB/s" but you see 800 MB/s when copying a 200 GB file. The headline number is the SLC cache speed; the real number is the native TLC/QLC speed once the cache is saturated.

SLC caches are dynamic on most drives. When the drive has lots of free space, the cache can be large (tens of gigabytes). When the drive fills up, the cache shrinks to leave room for user data. A full drive may have only a few hundred megabytes of SLC cache, which is why an SSD that is 95% full feels noticeably slower than one that is 50% full.

TRIM and the Alignment Problem

TRIM is how the filesystem tells the SSD that a range of LBAs is no longer in use. Without TRIM, the drive has no way to know which of its logical blocks still matter to the OS. Every overwritten LBA's old data stays "valid" in the FTL's eyes forever, wasting spare space and increasing WAF.

The ATA TRIM command (and its NVMe equivalent, Dataset Management with the Deallocate attribute) takes a list of LBA ranges and marks them as deallocated. The FTL immediately invalidates the corresponding mapping entries. Any physical pages that were mapped by those LBAs become garbage, available for GC. This dramatically reduces WAF under heavy workloads.

Modern filesystems issue TRIM in one of two ways. Continuous TRIM (discard mount option on Linux) issues a TRIM every time a file is deleted. Batch TRIM (fstrim) walks the filesystem periodically and TRIMs all currently-unused blocks in one big operation. Continuous is simpler but adds latency to every delete. Batch amortises the cost but leaves stale data visible to the FTL between runs. Most Linux distros in Europe default to a weekly fstrim.timer.

Alignment matters because NAND is organised in pages, and a 4 KiB random write that crosses a 16 KiB page boundary causes the FTL to perform two page writes instead of one. The first partition on modern GPT disks starts at LBA 2048 (1 MiB offset), which is aligned to any reasonable page size. Filesystems with 4 KiB blocks then map neatly onto the drive's internal layout. If you partition a drive by hand and pick an odd starting LBA, you can introduce misalignment that silently hurts every write. The fix is to always use parted or gdisk with their default alignment; they get it right for you.

Deliberate Overprovisioning on Linux

You can create extra overprovisioning on Linux by partitioning less than the full capacity. For an NVMe drive, the workflow looks like this:

# Check current size
$ sudo nvme id-ctrl /dev/nvme0 | grep tnvmcap
tnvmcap   : 1024209543168   # 1 TB
 
# Create a partition covering only 80% of capacity
$ sudo parted /dev/nvme0n1 --script mklabel gpt
$ sudo parted /dev/nvme0n1 --script mkpart primary 1MiB 800GB
 
# Ensure the remaining space is TRIMmed
$ sudo blkdiscard -f /dev/nvme0n1p1
$ sudo blkdiscard -o 800000000000 -l 224000000000 /dev/nvme0n1

The blkdiscard calls explicitly TRIM the unused region, which is critical. Without TRIM, the drive does not know those LBAs are unused and cannot treat the space as spare. With TRIM, those LBAs are marked invalid in the FTL and the physical pages they used to occupy become available for GC as additional spare capacity.

NVMe also supports namespace resizing on drives that expose the feature. nvme create-ns and nvme delete-ns let you carve up a physical drive into multiple logical drives, and you can leave some capacity unallocated to any namespace, giving the FTL extra headroom to work with. This is more common on enterprise drives than consumer ones, but some prosumer NVMe drives support it.

Read Disturb and Data Retention

NAND flash has two subtler failure modes that show up in long-lived drives.

Read disturb. Every time you read a NAND page, the voltages applied to the surrounding cells slightly shift their stored charges. Repeated reads of the same page gradually disturb the nearby pages until their bit errors become uncorrectable. Modern drives monitor read counts per block and rewrite a block when its read count exceeds a threshold, a process called read-disturb management. This background activity is invisible to the host but shows up in the drive's internal WAF statistics.

Retention drift. Data stored on a flash cell is held by electrons on the floating gate. Over time, thermal noise causes some of these electrons to leak out, shifting the cell's threshold voltage toward the erased state. At room temperature, a freshly programmed SSD holds data for years. At high temperatures, the leakage accelerates exponentially. The JEDEC specification for enterprise SSD retention is 3 months at 40°C when the drive is at end-of-life. Consumer SSDs are rated for 1 year at 30°C.

This is why storing an SSD in a hot car for a summer in Athens is not a great backup strategy. The drive does not fail catastrophically, but you may start finding bit errors on reads after a few months unpowered. A drive that is powered on corrects this automatically: the firmware notices cells that are drifting, reads them while the ECC can still decode them, and rewrites them to fresh locations. A drive left unpowered cannot do this and simply accumulates errors.

ECC: The Quiet Hero

Every SSD stores error correction parity alongside every NAND page. Without ECC, modern flash would be unusable: raw bit error rates on TLC at end-of-life are around 1e-3 to 1e-2, meaning one bit in a thousand is wrong. After ECC, the corrected bit error rate drops to 1e-15 or better, which matches enterprise-grade reliability.

The ECC codes used on modern SSDs are LDPC (Low Density Parity Check) codes, the same family used in 5G wireless and modern disk drives. LDPC can correct hundreds of bit errors per 4 KiB page at the cost of a few hundred bytes of parity per page. The decoder is iterative: it makes a first pass through the error-correction equations, updates its belief about each bit, and repeats until the result is consistent or a maximum iteration count is reached.

As cells wear out, the raw error rate climbs, and the decoder has to do more iterations to converge. You can see this as increasing read latency on old drives: the flash is still technically readable, but the ECC is spending more time correcting errors. When the raw error rate exceeds what LDPC can handle, the block is marked bad and retired. Drives keep a pool of spare blocks for exactly this purpose. The drive reports its remaining spare count through SMART attributes, which is how you know when a drive is near the end of its life.

You can see the current state with smartctl:

$ sudo smartctl -a /dev/nvme0n1
...
Percentage Used:                    12%
Data Units Read:                    2,304,192 [1.17 TB]
Data Units Written:                 3,897,344 [1.99 TB]
Host Read Commands:                 18,234,991
Host Write Commands:                24,101,822
Controller Busy Time:                   482
Power Cycles:                       214
Power On Hours:                     4,872
Unsafe Shutdowns:                   18
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0

"Percentage Used" is the drive's estimate of how much of its endurance has been consumed. 12% after 2 TB of host writes on a 1 TB drive is healthy. A drive approaching 100% is running out of wear budget.

Temperature and Throttling

SSD controllers are fast, and fast chips get hot. A typical consumer NVMe drive at sustained write throughput draws 5 to 8 watts, which is a lot of heat for a small M.2 module. If the controller temperature rises above about 70°C, the drive starts thermal throttling: it deliberately slows down writes to cool off. Above 80°C, it may throttle heavily. Above 90°C, it may refuse writes entirely to protect itself.

You can watch this with nvme smart-log during a heavy write. The temperature climbs for the first few seconds, throttling kicks in, and write throughput drops to perhaps half of the unthrottled rate. This is one reason NVMe drives benefit from heatsinks: keeping the controller below the thermal throttling threshold can yield 2x sustained write performance on drives that otherwise spend most of their time throttled.

Laptop NVMe drives are particularly prone to thermal throttling because there is nowhere for the heat to go. A laptop in Madrid running a big file copy on a summer afternoon may hit the throttle threshold within seconds. Server NVMe drives in a well-cooled chassis rarely throttle at all. The difference shows up plainly in sustained workload benchmarks.

Temperature also affects retention. A cell programmed at 70°C and read at 25°C is slightly more reliable than one programmed and read at 25°C, because the higher temperature at programming time helps electrons distribute more evenly on the floating gate. But high operating temperatures accelerate retention loss for already-programmed cells. The practical sweet spot is a drive that runs warm but never hot: 40 to 60°C is ideal.

The Secure Erase Mystery

When you want to wipe an SSD, the right command is nvme format --ses=1 (cryptographic erase) or nvme format --ses=2 (user data erase). Both commands complete in under a second on a healthy drive. How do they possibly wipe hundreds of gigabytes that fast?

They do not erase the data. They throw away the key that can read it.

Every modern SSD encrypts all data with an internal key before writing it to NAND. The encryption is transparent: the drive generates a Media Encryption Key at first power-on, stores it in protected firmware storage, and uses it to encrypt every write and decrypt every read. The host does not see this and does not need to know about it. The purpose is not to protect against host-level attacks (the key is readable by any firmware-level attacker) but to enable instant secure erase.

When you issue a cryptographic secure erase, the drive generates a new Media Encryption Key and discards the old one. The ciphertext on the NAND is now undecryptable. The operation takes milliseconds. Subsequent reads return meaningless data (because the new key decrypts the old ciphertext incorrectly). Eventually garbage collection overwrites the old ciphertext with new data encrypted with the new key, at which point even a perfect attacker with access to the drive's internals cannot recover anything.

This is why the fast "secure erase" on an NVMe drive is genuinely secure. It is also why ATA Secure Erase on modern SATA SSDs is equally fast and equally trustworthy. The old pattern of "overwrite the drive three times with zeros" is not only unnecessary, it actively increases wear without increasing security.

Power Loss Protection

Data in flight during a power loss is at risk. Any write buffered in DRAM that has not yet been flushed to NAND is gone. Any NAND programming operation that was in progress when power dropped may have produced a corrupted page. Any metadata update that was mid-flight can leave the FTL's mapping table in an inconsistent state.

Enterprise SSDs solve this with Power Loss Protection (PLP): a bank of capacitors on the drive's PCB that holds enough energy to flush the DRAM cache and complete any in-progress writes when external power fails. The drive's firmware detects the power loss (via an on-board voltage monitor), stops accepting new commands, and uses the capacitor energy to finish its outstanding work. Rebooting the drive afterwards finds a clean state.

Consumer SSDs usually do not have PLP. In their case, the drive relies on journaling inside the FTL: before committing a new mapping entry, the drive writes a log entry to NAND. On power loss, the log is replayed on next power-up, rebuilding consistent FTL state. Data in DRAM that had not made it to NAND is lost. A database that did an fsync right before the crash might find the last transaction missing, even though the OS saw a success return from the fsync.

This is why consumer SSDs lie about fsync. An fsync returns as soon as the data has been written to the drive's DRAM cache, not to NAND. If power is lost between the DRAM write and the NAND flush, the database thinks the data is safe but it is not. Enterprise SSDs with PLP genuinely persist data by the time fsync returns. This single difference is the reason enterprise SSDs cost 5 times as much as consumer ones for databases that care about durability.

Plunger Caps and Enterprise Reliability

The difference between a consumer SSD and an enterprise SSD is often visible on the PCB. Enterprise SSDs have a row of small yellow or black capacitors near the controller: tantalum polymer capacitors sized to store enough energy to complete in-flight writes on power loss. Each one holds roughly 100 to 400 microfarads at 6.3 to 10 volts. A drive uses 4 to 20 of them depending on DRAM cache size and target hold-up time.

When power fails, the drive's supervisor IC detects the drop on the main supply, isolates the capacitor bank, and switches the drive over to capacitor power. A typical enterprise PLP implementation holds up the drive for 10 to 40 milliseconds, which is enough time to flush the DRAM cache, complete any NAND programs that were in progress, and write a checkpoint so the FTL can restart cleanly.

Consumer drives simply do not have this hardware. When power drops, the drive's DRAM loses its contents, any ongoing NAND program operations may leave half-written pages, and the FTL must rely on its journal (a log stored in NAND) to recover consistency on next boot. The journal protects the FTL itself, but it does not protect data that was in the DRAM cache and had not yet been committed to NAND.

Some prosumer drives split the difference with small capacitor banks that can hold up a minimal amount of in-flight metadata but not full data buffers. These are sometimes marketed as "power loss protection for FTL state", which is accurate but often misread as "power loss protection for user data". The distinction matters for any workload that cannot tolerate losing the last few seconds of writes.

Sustained Performance and the Steady State

Benchmarking SSDs is trickier than benchmarking HDDs because of the state-dependent behaviour. A fresh SSD (one that has just been securely erased) has the entire NAND as free space, a huge SLC cache, and no GC pressure. A half-full SSD that has been written and deleted many times is in "steady state" with constant GC activity, a small SLC cache, and much lower sustained write performance.

The right way to benchmark an SSD is to fill it, overwrite it randomly until GC is running continuously, and then measure performance over a long enough window (several minutes) to capture the full range of GC behaviour. This is called a preconditioning phase, and tools like fio support it directly.

$ fio --name=precondition --filename=/dev/nvme0n1 \
    --rw=randwrite --bs=4k --ioengine=libaio --iodepth=32 \
    --numjobs=4 --size=100% --loops=2 --direct=1 --group_reporting
 
$ fio --name=sustained --filename=/dev/nvme0n1 \
    --rw=randwrite --bs=4k --ioengine=libaio --iodepth=32 \
    --numjobs=4 --size=64g --runtime=600 --time_based \
    --direct=1 --group_reporting

The preconditioning step writes the whole drive twice, ensuring the SLC cache is exhausted and GC is active. The real benchmark then measures steady state. Numbers from this kind of benchmark are much lower than vendor spec sheets but much closer to what you will see in production.

Server workload sizing charts published by SSD vendors typically include steady-state random IOPS alongside peak IOPS for this reason. The peak is for marketing. The steady state is for planning.

Zoned Namespaces

A recent development in the NVMe world is Zoned Namespaces (ZNS). A ZNS drive exposes its capacity as a set of zones, each of which can only be written sequentially. You can write to zone 0 only by appending to it. Random writes within a zone are not allowed. To reclaim a zone, you have to reset it as a whole, erasing all its data.

This sounds restrictive, but it matches the underlying flash behaviour exactly. NAND wants sequential writes to whole blocks. Traditional FTLs lie to the host by accepting random writes and rewriting them sequentially behind the scenes, at the cost of GC, WAF, and spare area. ZNS moves that responsibility up to the host: the filesystem or application is in charge of sequential writes, and the drive can be much simpler, cheaper, and more efficient.

Linux has a zoned block device layer (/sys/class/block/.../queue/zoned) and filesystems like f2fs and btrfs have ZNS support. The usage is still niche, concentrated in hyperscaler storage, but the write amplification numbers are dramatic: a well-tuned ZNS setup can hit WAF close to 1.0 on workloads that would produce WAF of 3 or 4 on a conventional SSD. For very large deployments in European data centres, that difference translates into meaningful endurance savings and lower total cost per TB written.

Why SSDs Are Not Just Fast HDDs

Understanding the above changes how you think about storage in a few specific ways.

First, bandwidth is cheaper than latency on an SSD. A single 1 MB sequential read is often faster per byte than thirty-two 32 KiB random reads, because the random reads have 32 round-trips through the command queue while the sequential read has one. For HDDs this was even more true, but the ratio on SSDs is still large. Filesystems and databases that batch operations see big wins.

Second, random writes are much more expensive than random reads. A random read is one mapping lookup plus one NAND page read. A random write is one mapping update plus one page write, plus eventual GC overhead. The ratio of write amplification for random workloads can make writes 5 to 10 times more expensive than reads per byte.

Third, overprovisioning is the cheapest way to buy performance. A consumer drive used at 90% fill with a demanding workload can be noticeably faster if you leave 20% of its space unallocated and TRIM the unused region. You are trading capacity for headroom that reduces WAF and increases SLC cache size.

Fourth, fsync is a sharp tool. On consumer SSDs, fsync returns before data is in NAND. On enterprise SSDs with PLP, fsync is honest. The consequences for database durability are real, and the price difference reflects it. Never trust a consumer drive for a workload that needs strict durability under power loss.

Fifth, endurance is a budget, not a cliff. A 600 TBW consumer drive does not die at 601 TB. It just starts accumulating errors that the ECC struggles to correct, spare blocks run low, and eventually the drive transitions to read-only mode. You have plenty of warning if you monitor SMART. You have very little warning if you do not.

NVMe over Fabrics

The NVMe command set is not tied to PCIe. It can run over any transport that supports reliable message delivery: Fibre Channel (NVMe-FC), RDMA over Ethernet or InfiniBand (NVMe-RDMA), and plain TCP (NVMe-TCP). This is called NVMe over Fabrics (NVMe-oF), and it is how modern data centres share SSD capacity across many servers.

A server in Stockholm with NVMe-TCP enabled can export its local SSD as a target. Clients across the network connect to the target and see it as a local block device. Latency is 50 to 200 microseconds depending on the transport, which is higher than a direct PCIe NVMe drive (under 10 microseconds) but dramatically lower than iSCSI over similar hardware. The result is a cluster of machines that can share a pool of SSDs as if each one had them locally attached.

NVMe-oF has been adopted heavily in Kubernetes storage systems like Longhorn, Rook, and Lightbits, as well as commercial storage arrays from Pure, NetApp, and Dell. It is the protocol layer that makes modern software-defined storage for European cloud providers (Hetzner, Scaleway, OVH) possible at SSD speeds.

What Your Drive Tells You

SMART attributes on an NVMe drive give you a surprising amount of insight once you know what to look for. A few worth watching:

  • percentage_used: the drive's own estimate of consumed endurance.
  • data_units_written: total host writes. Multiply by 512 KB to get TB written.
  • media_errors: uncorrectable read errors. Non-zero is a problem.
  • available_spare: remaining spare capacity as a percentage of the starting value.
  • temperature: the current drive temperature. High temperatures accelerate retention drift and hurt write performance.
  • unsafe_shutdowns: number of times the drive lost power while in use. High counts on a consumer drive without PLP are a flag that data integrity may have been compromised.

Setting a simple alerting rule on percentage_used > 80% and available_spare < 10% catches most wear-related failures before they turn into data loss. smartctl can be scripted into Prometheus exporters, systemd-monit rules, or your normal monitoring stack.

A Note on Brand-Specific Behaviour

Different vendors have noticeably different FTL philosophies. Samsung's consumer drives tend to be aggressive with SLC caching, producing excellent peak numbers and good sustained performance on workloads that fit the cache. Crucial and Kingston lean toward stable sustained performance with smaller SLC caches. Enterprise vendors like Micron's MAX series and Intel's Optane line (before it was discontinued) ran with large static overprovisioning and predictable latency across the working range.

Firmware updates can change these characteristics mid-life. It is not unusual for a drive to receive a firmware update that reduces peak cache size but improves sustained performance, because the vendor learned something about how the hardware behaves in the field. Tracking vendor update notes for drives running in your infrastructure is worth the effort on any serious deployment.

There is also a category of drives that will not identify their underlying flash vendor or generation at all. Budget drives assembled by smaller brands often source whatever flash is cheapest at any given moment, meaning two drives bought a month apart with the same model number can have completely different flash internals and completely different behaviour. If you need predictable performance or endurance, sticking to a tier-one vendor's named models is the only safe play.

The Short Version

An SSD is not a silicon hard drive. It is a small computer with a translation layer that hides an underlying medium very different from what the host expects. Every write is remapped to a fresh page. Every deletion is hidden behind TRIM. Every overwrite causes invisible GC work later. The drive's performance is state-dependent in ways HDDs never were, with SLC caches, dynamic wear levelling, and write amplification all interacting. Its endurance is a budget that drains with every host write and with every GC-induced internal write.

Understanding the FTL, pages vs blocks, GC, SLC caching, and PLP is what lets you predict how a specific drive will behave under a specific workload. The lab for this article gives you a visual NAND grid to experiment with: watch writes land in fresh pages, invalid pages accumulate, GC kick in, and WAF climb as overprovisioning shrinks. It is the fastest way to internalise how different the inside of a modern SSD looks from the interface the kernel sees.