← Back to Logs

How Filesystems Survive A Power Cut

Try the interactive lab for this articleTake the quiz (6 questions · ~5 min)

Yank the power cord out of a laptop during a file save in a Frankfurt office and reboot it. Usually nothing bad happens. The laptop boots, the desktop appears, the document you were saving is either the old version or the new one, the filesystem mounts cleanly, and you move on with your day. This is such a routine non-event that it is easy to forget it used to be the other way around. A power cut in 1998 on a FAT32 or ext2 partition meant a long fsck and often a crossed file, a lost cluster, or a completely unreadable directory. Modern filesystems mostly do not have those failure modes any more, because they were built on top of decades of research into how to write data durably while a computer is allowed to die at any moment.

The transition from "unsafe" to "safe" is not a single idea. It is half a dozen closely related mechanisms stacked on top of each other: journaling, log-structured layouts, copy-on-write, write barriers, the fsync syscall and its interaction with the block layer, the semantics of durability promises from the storage device itself, and the application-level dance of rename-over-write that databases and text editors have used for decades. Each one solves a specific failure mode. Together they add up to the invariant that, after a crash, the filesystem and any application that followed the rules will be in a consistent state, and no promised write will be lost.

This article walks through every piece of that story. The goal is to build a concrete mental model of what "durable" actually means on Linux filesystems, why a naïve application can lose data even on a perfectly well-behaved ext4 system, and why the 2018 PostgreSQL fsync scandal was genuinely important instead of just academic nitpicking. By the end it should be clear why power-cut safety is a whole-stack property, and why breaking any layer in the stack leaves you with unsafe storage no matter how good the others are.

What Can Go Wrong When The Power Drops

Before looking at the solutions, it is worth being precise about what can actually go wrong. Unplugging a computer mid-write is not a single failure mode. It is a combination of several independent ones, and a filesystem has to handle all of them to be safe.

The most basic failure is a torn write: the storage device is in the middle of writing a single sector and only part of it lands on the platter (or the NAND page) before the power drops. A 4 KiB sector might end up with 2 KiB of the new data and 2 KiB of the old, and the sector is now internally inconsistent. Most modern drives avoid this at the sector level by using the drive's own write cache and small reservoir of charge (or, on SSDs, a super-capacitor and a fast-dump protocol), so that any write the drive has started is either completed from the reservoir or never visible at all. But that guarantee only applies to single sectors. A write that spans multiple sectors can be torn between sectors even on a perfectly honest drive.

The next failure is metadata inconsistency. A filesystem operation is almost never a single write. Creating a file means writing the inode (which describes the file), updating the directory block (which names it), updating the inode bitmap (to mark the inode in use), updating the block bitmap (if data blocks were allocated), and updating the superblock (for space accounting). If the power drops after some but not all of those writes have committed, the filesystem is in an inconsistent state: an inode that exists but is not referenced from any directory (or, worse, a directory that references an inode that has not been written yet).

The third is data-metadata mismatch. Even if the metadata is consistent after recovery, the data blocks pointed to by a file's inode may contain stale contents if they were allocated from the free pool but not yet overwritten with the new data at the time of the crash. Some applications (text editors saving a config file) can cope with the old contents, but some (databases writing transaction logs) cannot, and the mismatch becomes silent corruption.

And then there is the reordering problem. The storage device, its own write cache, the SCSI/AHCI/NVMe layer, the block layer inside the kernel, and the filesystem itself can all reorder writes in the pursuit of performance. Write A might be issued before write B by the application, but the device might commit B first, because B went to a block that was already pending in the cache. If the power drops between the two commits, the on-disk state contains B but not A, which is an ordering nobody anywhere in the stack intended. Ordering has to be explicitly enforced, or the recovery code has to tolerate any arbitrary subset of in-flight writes having landed.

A filesystem that wants to survive a power cut has to solve every one of these, not just some of them. And it has to do it in a way that is efficient enough that users are willing to pay the cost.

Journaling: The ext3/ext4 Answer

The dominant technique on Linux is journaling, and ext3 was the filesystem that made it mainstream. The core idea is simple: before updating any on-disk structure in place, first write a description of the change to a dedicated log area, then apply the change to its real location, then mark the log entry as done. After a crash, the recovery code walks the log, finds any entries that were written but not yet applied, and replays them. Any entry that was only partially written to the log itself is discarded, because it was never marked complete.

The ext4 journal is a circular buffer of fixed size, typically 128 MiB, embedded in the filesystem or in an external device. It is written as a sequence of transactions, each of which groups together all the metadata blocks touched by a single filesystem operation (or a batch of operations that happened to be in progress at the same time). A transaction consists of a descriptor block listing the blocks that will follow, the metadata blocks themselves, and a commit block with a CRC over the entire transaction. Only after the commit block is safely on disk is the transaction considered committed. The recovery code only replays transactions whose commit block is present and whose CRC checks out.

There is a subtlety about what "on disk" means. The filesystem submits the descriptor, data, and commit blocks to the block layer, but the block layer and the drive might still be holding them in volatile caches. To enforce the rule "commit block must hit the platter after the data blocks", ext4 issues a write barrier before the commit block. A barrier tells the block layer "flush everything you currently have in flight before continuing, and do not reorder across this point". On modern kernels, barriers are implemented as REQ_PREFLUSH (flush the drive's write cache before this request) and REQ_FUA (Force Unit Access, meaning this specific write must be on stable storage before the drive acknowledges it). The combination gives the filesystem a strict ordering point that survives the various layers of reordering below it.

Ext4 supports three journaling modes, each a different trade-off between safety and performance:

  • data=journal: both data and metadata go through the journal. Every single write is written twice: once to the journal, once to its real location. This is the safest mode and the slowest, because it doubles the write load.
  • data=ordered (the default): only metadata goes through the journal, but the filesystem guarantees that data blocks for a transaction are committed to their real location before the corresponding metadata commit is written to the journal. This means that after recovery, metadata always points at the correct data (or at old allocated blocks that will be freed on recovery), never at junk.
  • data=writeback: only metadata goes through the journal, and there is no ordering constraint on data. After a crash, metadata is consistent but may point to data blocks that contain stale or garbage content, because the data writes were not yet flushed.

Most users run data=ordered because it catches the common failure modes without the double-write cost of data=journal. The main exception is database servers, which often disable journaling altogether on the filesystem level because they implement their own durability scheme and do not want the filesystem's extra writes.

The journal covers metadata integrity beautifully. After any crash, ext4 mounts in a few hundred milliseconds, replays any pending transactions, and presents a consistent view of the filesystem. The days of ten-minute fsck runs after a power cut are over on journaling filesystems, as long as the journal itself is intact, which is almost always the case.

Copy-On-Write: The btrfs And ZFS Approach

Journaling solves the problem by writing things twice. Copy-on-write filesystems take a different approach: never overwrite anything, instead always write new data to free space and atomically update a single pointer at the end. The classic CoW filesystems on Linux and BSD are btrfs and ZFS, both of which descend from an older lineage that includes Network Appliance's WAFL.

A CoW filesystem is built around a tree. The root of the tree points at the current version of the filesystem. Every file, directory, extent map, and free space map is reachable from that root. To modify any block in the tree, the filesystem writes the new version to a previously-free location, then writes any parent blocks (the directory, the inode list, all the way up to the root) to new free locations as well, each one pointing at the new version of its child. Finally, the root is updated to point at the new top-level block, which implicitly makes all the new versions visible at once.

The key property is that the update of the root is a single atomic write. Everything below it is already on stable storage; the root either still points at the old tree, or it points at the new tree, and there is no intermediate state where half the filesystem has been updated. A crash during the intermediate writes (of child and parent blocks) leaves a partially-written new tree floating in the free space, but because the root is unchanged, the filesystem still sees the old, fully-consistent tree. On next mount, the orphan blocks are discovered and freed.

The CoW approach has beautiful properties. Snapshots are free, because you just keep an old root pointer around and never free the blocks it references. Checksumming every block is cheap, because you are writing new blocks anyway and can compute the checksum at write time. Rollback is trivial. And there is no "journal" as such: the filesystem is the log.

But there are costs. Write amplification is inherently high, because every update to a deep leaf propagates all the way up to the root. Every 4 KiB write might cause multiple metadata blocks to be rewritten. Free space fragments over time, because writes are scattered across the disk and the free pool becomes a patchwork. Sequential read performance on old data degrades as the logical layout drifts from the physical one. And the overhead of the tree metadata itself is significant; btrfs metadata can consume several percent of the disk on a heavily-used filesystem.

On modern flash, the write amplification is less of a concern (the SSD's own FTL is already doing something similar under the hood), but on spinning disks the seek costs can be brutal, which is why CoW filesystems have been slower to displace journaling ones on traditional storage.

ZFS adds another layer with the ZFS Intent Log (ZIL). Because CoW alone does not guarantee that every individual write() call is durable (the new tree only becomes visible when the next root update happens, which can be seconds later), ZFS maintains a separate log for writes that have been fsync'd but not yet folded into a new root. After a crash, the ZIL is replayed into the old tree before the filesystem becomes writable. This gives ZFS the "instant durability" that journaling filesystems provide without giving up the CoW structure for the bulk of data.

Log-Structured Filesystems: The Extreme Case

A log-structured filesystem (LFS) takes the "never overwrite" idea to its logical conclusion. The entire filesystem is a single circular log of segments. Every write, regardless of whether it is a new file, an overwrite of an existing file, a metadata update, or a tree rebalance, is appended to the head of the log. Old blocks become invalid as the log moves past them, and a background cleaner reclaims segments whose live content has dropped below a threshold by copying the remaining live blocks to the head and marking the source segment free.

The original LFS was proposed by Mendel Rosenblum and John Ousterhout at Berkeley in 1992, partly as a response to the observation that disk reads were increasingly being absorbed by caches (so disks were mostly doing writes) and partly as a way to exploit sequential write bandwidth, which was growing faster than seek times. The Sprite LFS implementation showed that a log-structured filesystem could outperform FFS on write-heavy workloads by a factor of two or three, at the cost of slower reads and higher cleaner overhead.

On Linux, the dominant log-structured filesystem is f2fs, designed by Samsung specifically for NAND flash. On phones and embedded devices, f2fs is often the default. We covered its design in detail in a separate article on Android flash storage, but the power-cut story is worth revisiting here. f2fs maintains a checkpoint structure at the head of the log, and every few seconds (or when explicitly requested by fsync) it writes a new checkpoint that commits the current state. After a crash, the recovery code finds the most recent valid checkpoint and rolls forward any writes that were made after it. Because writes are strictly append-only, roll-forward is a simple operation: replay any log entries with valid checksums that appear after the checkpoint.

The main disadvantage of pure log-structured layouts is the cleaner. Old segments have to be reclaimed, and that means reading the live blocks out and writing them back to the head, which competes for bandwidth with the live workload. On a nearly-full filesystem the cleaner can become the dominant cost, and the original Sprite LFS papers spend a lot of time on cleaner policies. On flash, the underlying FTL is already doing its own garbage collection, which doubles up on the work: the filesystem cleaner moves blocks, and then the FTL sees those moves as new writes and runs its own cleaner on them. f2fs has been optimised to minimise this double-up by aligning its segment boundaries with the FTL's erase blocks and by using TRIM aggressively so the FTL knows which blocks are free.

fsync: The Contract Between Application And Kernel

All of the above is filesystem machinery. Applications interact with the filesystem through a much simpler interface: write(), fsync(), fdatasync(), rename(). Understanding what each of these actually promises is where most durability bugs live.

write() returns as soon as the data has been copied from the application's buffer into the kernel's page cache. At that point the data exists only in RAM. Crash now and it is gone. The kernel will eventually flush the dirty pages to disk (by default every 30 seconds via the dirty_writeback_centisecs tunable, or when the fraction of dirty pages exceeds a threshold), but until then there is no durability whatsoever. write() is a fast, cheap operation that makes no promises about power-cut safety.

fsync(fd) is the operation that promises durability. It tells the kernel: "make sure every dirty page associated with this file is on stable storage before you return, and make sure the file's metadata (size, modification time, block allocation) is also on stable storage". The kernel flushes the dirty pages through the block layer, issues any necessary write barriers, and only returns when the underlying storage has acknowledged that everything is stable. At the point fsync returns, the file really is durable in the sense that a crash immediately after would leave the file exactly as it was at the moment of the call.

fdatasync(fd) is the same as fsync except that it does not necessarily flush metadata that is not needed to find the file's data. If the file's size is unchanged and only the data has changed, fdatasync can skip the metadata write, which is faster. Most databases use fdatasync rather than fsync for this reason, and they allocate file space in large chunks up front to avoid size changes during normal operation.

The subtlety is what exactly "on stable storage" means. At the filesystem level, ext4's fsync walks the dirty inode list, triggers a journal commit for any pending transactions involving this file, and waits for the commit to flush through the block layer with the appropriate barriers. This is the correct behaviour, and it matches what applications expect. But there is a complication two layers down: what happens at the block layer when the write is actually in flight.

The block layer maintains its own request queue per device and submits requests to the drive. The drive has its own write cache, which is typically DRAM backed up by either a supercapacitor (on enterprise drives) or nothing (on consumer drives). When the block layer sends a write, the drive can choose to acknowledge it immediately from its DRAM cache, without actually committing to the media yet. If the power drops at that moment, the cached write is lost, even though the application and the kernel both thought it was committed.

This is what the REQ_PREFLUSH and REQ_FUA bits are for. REQ_PREFLUSH says "before processing this request, flush all previously-acknowledged writes from your cache to the media, and only start this request after the flush is complete". REQ_FUA says "for this specific request, do not acknowledge until it is actually on the media, bypassing the cache acknowledgement". Together, they give the filesystem a way to insist on real durability rather than cached-durability, at the cost of a significant performance hit (a cache flush is expensive on any SSD and brutally so on a spinning disk).

A correctly behaving fsync on ext4 issues these barriers. At the end of the call, the data is genuinely on the media, and a power cut cannot undo it. At the cost of some latency: a single fsync on a consumer NVMe drive typically takes 1 to 10 milliseconds, and on a spinning disk it can take tens of milliseconds or more.

The PostgreSQL fsync Scandal

Which brings us to the part of the story that shook the database world in 2018. Until then, almost every database was built on the assumption that if fsync() returns successfully, the data is durable. And if fsync() returns an error, the database can retry it, or at worst crash and recover from its write-ahead log.

The scandal was that this assumption was wrong on Linux.

The actual Linux semantics, which had been accidentally documented in the fsync man page but not widely understood, was: if a dirty page in the kernel's page cache fails to write back to disk (because the disk returned an I/O error, for example), the kernel marks the page as clean (because it tried and failed, and does not want to try again indefinitely), records the error in the file's struct file, and then, on the next fsync on that file descriptor, reports the error exactly once. After that report, the error is cleared. Any subsequent fsync on the same file descriptor, even one issued by a different process, will succeed, even though the data is permanently lost and there is nothing in the kernel to retry it.

The consequence is that a process that did not observe the error (because it did not have that file descriptor open, or because it was a different process entirely, or because the error was cleared by someone else before this process got to it) has no way of knowing that a previous write failed. Its subsequent fsync returns success, the process concludes that the data is safe, and the data is in fact gone.

PostgreSQL's architecture made it particularly vulnerable. It uses multiple backend processes, and each backend can open a file and call fsync on it independently. If a backend issued a write, exited, and a different backend later called fsync on the same file, the error from the first backend's write could have been cleared before the second backend's fsync ran, and the second backend would see success. Meanwhile the write-ahead log would have been committed under the assumption that the data was durable, and a subsequent crash would recover to an inconsistent state.

The bug was discovered by Craig Ringer and Thomas Munro in early 2018, and the write-up (titled something like "PostgreSQL's fsync() surprise") traced exactly how the Linux kernel's error reporting model diverged from what PostgreSQL had assumed. The response was a hard problem. The Linux kernel could not simply change its behaviour (because other software depended on the existing semantics), and PostgreSQL could not simply retry failed writes (because the kernel had already thrown away the pages). The eventual fix was a combination of things: PostgreSQL was modified to call panic() on any fsync error, treating it as a fatal crash that required recovery from the WAL. The Linux kernel was modified in 4.13 and later to make the error reporting more robust (the "errseq" mechanism), so that any file descriptor opened after an error occurred would still see the error. And database operators were educated that fsync-returns-success is a weaker guarantee than they had thought.

The broader lesson was that durability is a whole-system property. A filesystem that does everything right can still be defeated by a kernel bug, a drive that lies about cache flushes, a bug in the SCSI driver, or a subtle semantic mismatch between the application's expectations and the POSIX syscall interface. The only robust approach is to understand every layer of the stack and to test end-to-end with real power cuts, not just simulate them.

Write Barriers, FUA, And The Drive Cache Problem

The problem of drives lying about fsync deserves its own section because it is the layer where most real-world data loss in history has actually happened.

Until the late 2000s, consumer SATA drives frequently shipped with write caching enabled by default and no honest way for the operating system to flush the cache. The ATA command set had a FLUSH CACHE command, but some drives silently ignored it, some drives implemented it but also cached the flush itself (so a crash between the flush and the power-off would still lose the cached writes), and some drives returned success from FLUSH CACHE before actually flushing anything. The Linux kernel had to assume that FLUSH CACHE worked correctly, and most of the time it did, but the exceptions were common enough to cause real data loss in real deployments.

The situation got better when the kernel adopted the REQ_PREFLUSH and REQ_FUA bits as part of the block layer's request model and started using them aggressively for filesystem journal commits. Drives that lied about flushes were tested and blacklisted. Enterprise drives shipped with supercapacitors or battery-backed RAM so that even if the drive acknowledged a write from its cache, it could still commit the cache to the media during a power loss. Consumer drives without that protection were increasingly marketed as "desktop" drives with the caveat that they might lose data on power cuts.

SSDs introduced a new complication. Modern SSDs have large DRAM caches for the FTL's metadata and for write buffering, and the same lying-about-flushes problem could happen at that level. Enterprise SSDs include supercapacitors that can power the entire drive long enough to flush the DRAM cache back to NAND on a power loss (this is called "power loss protection" or PLP in the datasheets). Consumer SSDs often do not, and they can lose data on a power cut even if the kernel does everything right, because the data the kernel last flushed was sitting in the drive's DRAM and never made it to the NAND.

The practical result is that if you care about durability on a laptop with a consumer SSD, you cannot fully trust the drive. A well-designed filesystem like ext4 minimises the damage by making sure every commit is accompanied by a real FLUSH, but if the drive's FLUSH is a lie, the filesystem has no way to know. This is why enterprise and financial databases almost universally run on hardware with PLP and why Google's early data centre papers make such a big deal of testing drive behaviour under power loss.

The current state on Linux is that fsync does honestly pass through barriers to the drive, the drive mostly honours them, the filesystem mostly does the right thing at the journal level, and applications that call fsync correctly mostly get the durability they expect. The "mostly" is where the interesting bugs live.

Rename-Over-Write: The Application Pattern

There is a whole class of durability bugs that have nothing to do with the filesystem's internal guarantees and everything to do with how applications structure their writes. The classic failure pattern looks like this:

with open("config.json", "w") as f:
    f.write(json.dumps(new_config))

The programmer believes this atomically updates config.json to the new contents. In fact it does something much weaker. First, the open call in write mode truncates the existing file to zero length. Now the file is empty on disk (assuming the metadata has been committed). Then the write call copies the new contents into the kernel's page cache. Then the close call returns, without flushing anything to disk. The kernel will eventually flush, but there is a window (typically up to 30 seconds) during which the file on disk is either empty, or contains a partial new version, or contains the full new version, depending on exactly when the flush happens.

If the power drops during that window, the user can end up with an empty file where their carefully-crafted configuration used to be. This is not a filesystem bug. It is an application bug: the application asked for the file to be truncated and then refilled, and the power-cut happened between the truncate and the refill.

The correct pattern is rename-over-write:

with open("config.json.tmp", "w") as f:
    f.write(json.dumps(new_config))
    f.flush()
    os.fsync(f.fileno())
os.rename("config.json.tmp", "config.json")

This writes the new contents to a temporary file, forces them to disk with fsync, and then atomically renames the temporary file over the target. On POSIX filesystems, rename is guaranteed to be atomic: at any moment, the name config.json either refers to the old file or the new file, never to nothing or to something in between. If a power cut happens during the write to the temporary file, the old file is untouched. If it happens after the rename, the new file is in place. There is no window during which the user's configuration can be lost.

Even this is not quite enough on all filesystems. The atomicity of rename is a promise about the name, not about the durability of the new file. If the rename returns successfully but the rename-metadata commit has not yet been flushed, a crash can leave the directory containing config.json pointing at the new inode, but the new inode's data blocks might not be fully on disk yet. Ext4 handles this by committing the rename transaction in the journal with an implicit dependency on the new file's data being flushed first, but it only does this for the ordered mode, and only when the new file has been fsync'd before the rename. If the application skipped the fsync, the rename might commit before the data, and a crash leaves the directory pointing at an inode with corrupt contents.

The canonical fix, applied around 2009, was the ext4 auto_da_alloc mount option (now the default) which forces the data to be flushed before any rename-over-write transaction commits. This was added after a wave of complaints from Firefox users whose session restore files were getting truncated on crashes, and it means that modern ext4 protects against the most common application-level mistake even when the application itself did not do the right thing.

The fsync Storm And How Databases Handle It

Databases that care about durability (which is all of them, for transactional workloads) have to call fsync after every committed transaction. On a fast SSD that is a few milliseconds per transaction. On a network-attached storage device it can be tens of milliseconds. Serialising every commit through a sequential fsync limits transaction throughput to a few hundred per second on consumer hardware, which is often too slow.

The solution is group commit. Instead of calling fsync after each transaction independently, the database batches committed transactions that are ready to flush and issues a single fsync that commits them all at once. The first transaction in a batch pays the full fsync latency; subsequent transactions that arrive during that fsync wait and are committed for free when the fsync returns. This turns a sequence of small fsyncs into one larger fsync and dramatically improves throughput on workloads with many concurrent writers.

PostgreSQL implements group commit with a dedicated walwriter process that collects WAL (write-ahead log) writes from backends, batches them, and fsyncs them periodically. MySQL's InnoDB does something similar with the innodb_flush_log_at_trx_commit tunable, which controls whether each commit forces an immediate flush or allows grouping. SQLite has a PRAGMA synchronous = NORMAL mode that skips some fsyncs to improve throughput at the cost of potentially losing recently-committed transactions on a crash.

The trade-off is real. Full fsync-per-commit gives the strongest guarantee (every successfully committed transaction survives any crash) but the lowest throughput. Group commit gives essentially the same guarantee with much higher throughput, but the code is harder to get right because the database has to carefully track which transactions have been included in which fsync batch. Skipping fsyncs entirely gives the highest throughput but can lose the last few seconds of committed transactions on a crash, which is often unacceptable for financial workloads but fine for things like web session stores.

Checksums: Catching What The Hardware Misses

All of the above assumes that when the filesystem reads a block from disk, it gets back the same bytes it wrote. This is not always true. Bit rot happens on spinning disks. Cosmic rays flip bits in RAM. Cable electrical noise corrupts SATA transfers. Firmware bugs write to the wrong sector. A filesystem that trusts the hardware unconditionally can be led to corrupt user data by any of these failures, silently, with no indication that anything has gone wrong.

Modern filesystems increasingly use checksums to catch these errors. ZFS checksums every block (data and metadata) with a hash that is stored in the parent block, so reading any block implicitly verifies that it matches what was written. A mismatch triggers either a read from a redundant copy (if one exists, as in a ZFS mirror or RAID-Z) or a hard I/O error. Btrfs does the same. f2fs checksums metadata blocks and, optionally, data blocks. Ext4 added metadata checksums in 2012 (feature flag metadata_csum) and they are now enabled by default on new filesystems.

The checksums do not prevent corruption, but they turn silent corruption into detectable corruption. That is a huge improvement: a database that hits a checksum error can fail the read and report the error, rather than returning wrong data to the application. Combined with ECC memory, proper fsync, and regular backups, filesystem checksums give an end-to-end integrity guarantee that was impossible in the FAT32 era.

XFS And NTFS: Two More Journal Shapes

Ext4 is not the only journaling filesystem in serious use. XFS, originally developed by SGI in the early 1990s and still the default on Red Hat Enterprise Linux, takes a different approach to the same problem. XFS keeps an intent log rather than a physical block log: instead of writing the full new version of every modified metadata block into the journal, it writes a small structured record describing the change ("allocate this inode", "insert this directory entry", "extend this file"). Recovery replays the intents against the on-disk state, which is compact enough that even heavy metadata churn fits in a relatively small log.

XFS was designed around the assumption of very large filesystems with many parallel writers, and its journal is sharded across multiple in-memory log buffers so that many CPUs can push intents simultaneously without contending for a lock. On a 32-core machine pushing small file creates, XFS can sustain several hundred thousand operations per second, and its journal remains a stream of compact intents that flushes to disk in a single sequential write per commit. The cost is that the recovery logic is more complex than ext4's: the recovery code has to understand every intent type and know how to reapply it, and bugs in intent handling can produce subtle post-recovery inconsistencies. XFS has had a few such bugs over the years, most memorably the 2020 inode cache flush bug that caused rare metadata loss on crashes during heavy delete workloads.

NTFS, the Windows filesystem, is also a journaling filesystem, but its journal (the $LogFile) is more like XFS's intent log than ext4's physical block log. NTFS journals not the modified blocks but the logical operations that changed them, as "undo/redo" records. Recovery replays the redo side of any committed transaction and undoes any in-progress transaction that did not commit. The NTFS journal is also used for an entirely separate feature, the USN change journal, which records every metadata change on the volume and is consumed by search indexers, backup software, and file synchronisation tools. The durability properties are broadly similar to ext4 in ordered mode: metadata is consistent after recovery, data blocks belonging to committed transactions are intact, and uncommitted writes are lost.

One interesting detail about NTFS is how it handles the volume boot record and the system files ($MFT, $Bitmap, $Secure and so on). These files are stored as normal NTFS files internally, with their own journal entries and their own protection, but they have to exist before the volume can be mounted. NTFS solves the bootstrap with a small on-disk metadata region that points at the $MFT, plus redundant copies of the boot record for recovery. The trick is that everything above the boot record is journaled normally, so a crash during even the most deeply internal NTFS operation can be recovered by replaying the log.

SQLite: Rollback Journal Versus WAL Mode

Applications that embed their own transactional store (rather than delegating to a full database server) face all of the same durability problems at a much smaller scale. SQLite is the archetype: a single-file database library that powers phone apps, browser history, desktop config stores, and uncountable other things. Because SQLite has to be correct on whatever filesystem happens to be underneath it, its durability story is a good illustration of what an application can do on top of a generic filesystem.

SQLite has two durability modes. In the older rollback journal mode, every transaction starts by copying the existing contents of the modified pages into a separate journal file, fsyncing the journal, then overwriting the database file in place, and finally deleting the journal. On a crash, if the journal file exists, SQLite knows a transaction was in progress; it replays the old pages from the journal to undo the partial changes, and the database is returned to its pre-transaction state. The sequence of fsyncs is the critical part: SQLite issues a fsync after writing the journal, another after writing the new pages to the database, and a third after truncating or deleting the journal to finalise the transaction. Each fsync enforces an ordering that recovery relies on.

The rollback journal works, but it is slow. Every transaction pays for a full fsync on the journal, a write-then-fsync on the database, and a final fsync on the journal deletion. Small frequent transactions, like the ones a web browser does to update its history, become dominated by fsync latency.

SQLite WAL mode, added in version 3.7.0 in 2010, flips the architecture. Instead of writing new pages to the database and keeping a rollback journal to undo them on crash, SQLite appends new pages to a separate WAL file and leaves the database file untouched. Readers walk the WAL to find the latest version of each page; writers append new pages and fsync the WAL. Checkpoints periodically fold the WAL back into the database file. Because the WAL is append-only and the database file is only modified during checkpoints, the hot path is a single sequential write plus fsync, and throughput improves by an order of magnitude for small transactions. WAL mode also allows concurrent readers during a writer, which rollback mode does not.

Both modes are power-cut safe, assuming the underlying fsync is honest. SQLite has historically been paranoid about fsync semantics and was one of the loudest voices in the 2018 fsync discussion that followed the PostgreSQL scandal. Its documentation still lists specific filesystems and mount options that have known issues, and it goes to considerable lengths to detect half-written WAL frames via checksums embedded in each frame.

O_DIRECT, io_uring, And The New Kernel Interfaces

So far the discussion has assumed standard buffered I/O through the page cache. But there are other ways to talk to storage, and they have different durability semantics that are worth knowing about.

O_DIRECT is an open flag that bypasses the page cache. Writes go directly from the application's buffer to the block layer, skipping the intermediate copy into kernel RAM. The semantics are different: an O_DIRECT write is still not durable when write() returns (the block layer still has the request queued), but the data is not in the page cache, so there is no concept of "dirty pages to be flushed later". Applications that use O_DIRECT typically also issue their own fsync or use the O_SYNC flag, which forces every write to block until it is on stable storage.

Databases that manage their own buffer pools (PostgreSQL, InnoDB, Oracle) often use O_DIRECT because they do not want the kernel double-caching their data in the page cache. They handle the caching, prefetching, and write-back themselves, and they want the kernel to get out of the way. The trade-off is that they lose the kernel's optimisations around read-ahead, write coalescing, and memory pressure handling, so they have to reimplement those themselves.

io_uring, introduced in Linux 5.1 in 2019, is a newer asynchronous I/O interface that lets applications submit I/O requests through a shared ring buffer rather than through individual syscalls. It is significantly faster than the older aio_read/aio_write interface, because each request no longer requires a syscall, and it supports almost every I/O operation in the kernel including fsync, fdatasync, and sync_file_range. The durability semantics are the same as for synchronous syscalls: an fsync submitted through io_uring has exactly the same guarantees as an fsync called directly. The difference is that the application can batch many requests (including dependent ones) into a single submission ring round-trip, which is a big win for high-throughput database workloads.

sync_file_range deserves a brief mention because it is dangerous. It is a Linux-specific syscall that forces a range of a file's dirty pages to be written to disk, but by default it does not issue any write barriers or cache flushes at the block layer. The data is handed to the drive, but the drive is free to keep it in its own cache. sync_file_range is faster than fsync because it skips the cache flush, but it provides no durability guarantee against a power cut: on a crash, the data might be lost even though sync_file_range returned success. The only legitimate use is to initiate writeback early so that a later fsync on the same file has less work to do. Using it as a drop-in replacement for fsync is a classic durability bug that is still being found in production codebases.

Testing For Real: Power-Cut Regression Is Hard

Because durability failures only manifest during a power cut, testing for them is genuinely difficult. You cannot write a unit test that reproduces the exact timing of a crash between two I/O operations, because the timing depends on hardware details you do not control. But there are a few approaches that do catch real bugs.

The classic tool for drive-level honesty is diskchecker.pl, a small Perl script written by Brad Fitzpatrick in 2005 that runs two processes: one on the device under test that writes a sequence of numbered blocks and calls fsync after each, and one on a separate machine that records which writes the first process reported as committed. At an arbitrary moment the operator physically yanks the power on the device under test. After the reboot, the two processes reconcile: every write that was reported as committed before the power cut must still be readable on the device. If any are missing, the drive is lying about fsync. Running diskchecker.pl against a fresh consumer SSD is a sobering experience; it used to catch lying drives routinely in the 2000s and still catches them occasionally in the 2020s.

A more controlled approach uses fault injection in a virtualised environment. QEMU's -drive werror=report mode lets you inject synthetic I/O errors mid-test, and filesystem test suites like xfstests include a category of crash-consistency tests that run a workload, simulate a crash at various points, remount the filesystem, and verify that the result is one of the allowed post-crash states. xfstests generic/455 and its neighbours test ext4, XFS, and btrfs against a systematic crash schedule.

For applications, the dm-log-writes target in Linux's device-mapper layer records every write issued to a block device, and replay-log replays the log up to a specific point and then presents the resulting disk state for verification. This lets you test "if the crash happened right after this particular fsync, does my application recover correctly?" without actually crashing anything. Several database projects, including PostgreSQL and RocksDB, use dm-log-writes in their CI to catch durability regressions before they ship.

None of these tools is perfect, and the hardest-to-reproduce bugs are often the ones where the drive's firmware has a subtle cache-flush bug under a specific workload that only appears at scale. Google's 2019 paper "Characterizing, Exposing, and Understanding SSD Failures" documented a string of real-world firmware bugs they had found in production, including one where a particular enterprise SSD would lose recently-committed data if you issued too many TRIM commands in quick succession before the flush. Finding that required running workloads across thousands of drives for months; no unit test would have caught it.

What It Means In Practice

Pulling all of this together: when you unplug a modern Linux laptop with ext4, f2fs, or btrfs, here is what actually happens.

  1. The drive sees the power loss. If it has power-loss protection, it finishes writing its cache to the media. If it does not, anything in its cache is lost.
  2. The kernel's dirty pages that had not yet been flushed are lost with the kernel.
  3. On reboot, the filesystem mounts and runs recovery.
  4. For ext4, recovery walks the journal, finds the last committed transaction, and replays any metadata changes in that transaction that had not yet been applied to the main filesystem. Metadata is now consistent. Data blocks that were owned by files in flushed transactions are intact. Data blocks for in-progress writes that were not fsync'd are possibly garbage, but the filesystem treats them as free space.
  5. For btrfs or ZFS, recovery walks the tree from the last good root. Any writes that had started but not reached a tree-level commit are invisible; the filesystem atomically rolls back to the previous snapshot.
  6. For f2fs, recovery finds the last valid checkpoint and rolls forward any transactions that appear after it.
  7. Applications that called fsync on their writes get those writes back. Applications that did not call fsync may lose any of their recent writes.
  8. Applications that used rename-over-write to update files get the old version or the new version, never an empty or half-written file.
  9. The user logs in, the desktop appears, the file they were editing is either the last-saved version or a slightly older autosave, and life goes on.

The reason this is so boring is that every layer does its job. The drive does not lie. The block layer passes barriers through. The filesystem honours fsync. The application uses the correct patterns. The kernel's recovery code knows how to pick up the pieces. When any one of those fails, the boring non-event becomes a corruption incident, and the debugging is brutal because the failure can be in any layer.

The history of filesystem design is largely a story of moving responsibility for durability from the application to the kernel, from the kernel to the filesystem, from the filesystem to the block layer, and from the block layer to the drive, while making each layer's promises more precise and more testable. We are not finished; new hardware keeps introducing new durability surprises, and new workloads keep finding new edge cases. But for the common case of "I plugged my laptop in and something got saved", the guarantees are now strong enough that unplugging the laptop is not a data-loss event, as long as the app saving the file did the minimal right thing. That was not always true, and it took an enormous amount of work to get here.