How Write-Ahead Logging Actually Works
Try the interactive lab for this articleTake the quiz (6 questions · ~5 min)Databases make a promise that sounds simple and is anything but: once a transaction commits, it stays committed. You can pull the power cord, you can panic the kernel, you can trip the breaker in the server rack in Brussels in the middle of a write storm, and when the machine comes back up, every committed transaction will be exactly as it was, and every uncommitted transaction will be rolled back as if it had never existed. This property is the "D" in ACID, durability, and it is the reason databases are trusted with the kinds of data you cannot afford to lose.
Delivering on that promise is harder than it sounds. A database is juggling hundreds of pages of cached data in memory, sometimes written back to disk in batches for performance, sometimes left hot in the buffer pool for minutes or hours. If the machine crashes while some of those dirty pages are halfway to disk, you can end up with a torn write, or a mix of old and new pages that contradict each other, or worse. And even if every page made it to disk correctly, you still need to know, on restart, exactly which transactions had committed and which had not.
The universal answer to this, used by nearly every real database in existence, is a write-ahead log (WAL). The rule is brutally simple: before you modify any data page on disk, write a record of the change to a log, and make sure the log is physically on stable storage first. If the machine crashes later, recovery reads the log, replays the committed changes, and rolls back the incomplete ones. The data pages themselves can then be treated as a lazy, eventually-consistent cache of the truth, because the truth lives in the log.
This article explains how write-ahead logging actually works in the databases you probably use: PostgreSQL, InnoDB (the main MySQL and MariaDB engine), SQLite, and with a shorter detour, the architecture in RocksDB, LMDB, and other modern storage engines. We will cover the log record format, the commit protocol, checkpointing, the subtleties of what "on stable storage" really means, and the ways the protocol has been adapted for cloud and replication. By the end you should know why fsync matters, what pg_wal actually contains, and what happens inside those ten milliseconds between COMMIT and the acknowledgement you get back.
The Fundamental Problem
Imagine a database with no log at all. The database has a buffer pool in memory holding copies of pages from disk. A transaction comes in and updates a row. The page in the buffer pool is modified. Periodically, the database flushes dirty pages back to disk.
Now consider the failure modes. If the machine crashes after the page has been written back, the change is durable. If it crashes before the page has been written back, the change is lost, even though the transaction may have already told the client it committed. That alone is a violation of durability. Worse, crashes in the middle of writing a page can produce "torn writes", where part of the old page and part of the new page end up on disk together, a state that the database's data structures may not even be able to parse.
You can try to fix this by making every write synchronous: every update flushes its page to disk immediately, before acknowledging the transaction. This makes commits durable, but it is incredibly slow. Every update touches at least one page, and pages are typically 8 or 16 KiB, so even a small update triggers a 16 KiB disk write. Updates scattered across many pages become scattered disk writes. On a spinning disk, this is catastrophic: 100 random writes per second against a database that wants to do thousands of transactions per second. Even on fast NVMe, the write amplification is enormous.
The write-ahead log resolves this elegantly. Instead of flushing the modified data pages on every commit, flush a small log record describing the change. The log is written sequentially, which is fast on any storage medium: spinning disks hit their peak throughput on sequential writes, SSDs love sequential because it produces large, aligned writes that align with their internal garbage collection. A single disk flush of a few hundred bytes can durably commit a transaction, even if the actual data pages are still dirty in memory and will be written back lazily, minutes later.
This yields two rules that every WAL-based system follows, sometimes called the ARIES rules after the foundational paper by Mohan, Haderle, Lindsay, Pirahesh, and Schwarz at IBM in 1992.
Rule 1: Write-ahead logging. Before a data page is written to disk, any log records describing modifications to that page must already be on stable storage. The implication is that the log is always at least as recent as the data. If a page makes it to disk, its history is recoverable from the log.
Rule 2: Commit before the data. Before a transaction can be reported to the client as committed, its commit record must be on stable storage. The data pages the transaction modified need not be flushed yet; they will be flushed later by the background writer or by recovery.
These two rules together make the commit fast (one log flush per commit, not many data page flushes) and make recovery possible (the log contains everything needed to reconstruct committed state). Every major relational database is some variation on this theme.
What A Log Record Actually Looks Like
A WAL is a sequence of records appended to a log file. Each record describes one change. The exact format varies between databases, but they all contain roughly the same fields.
In PostgreSQL, a WAL record looks like this in the source code (src/include/access/xlogrecord.h):
typedef struct XLogRecord {
uint32 xl_tot_len; /* total length of record */
TransactionId xl_xid; /* xact id */
XLogRecPtr xl_prev; /* ptr to previous record in log */
uint8 xl_info; /* flag bits, see below */
RmgrId xl_rmid; /* resource manager for this record */
/* 2 bytes of padding here, initialize to zero */
pg_crc32c xl_crc; /* CRC for this record */
/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, then data. */
} XLogRecord;The record starts with its total length, a transaction ID, a pointer to the previous record (so recovery can walk backwards), and a resource manager ID. Resource managers (RmgrId) are Postgres's way of dispatching record handling to the right subsystem: there is an Rmgr for heap updates, one for btree index changes, one for transaction commit/abort, one for GIN/GiST/SP-GiST indexes, one for 2PC prepared transactions, one for standby hot replay, and a few more. Each Rmgr knows how to interpret its own record bodies.
The record body contains block-level headers (which pages this record affects) and either the full new contents of the affected page (for full-page writes, used after checkpoints) or just the delta (for incremental updates). The last field is a CRC32C checksum over the record, used on replay to detect partial writes.
InnoDB's WAL records are called redo log records (mtr_buf_t in the source) and follow a similar structure. Each mini-transaction (MTR) accumulates a set of redo records in memory while executing, and on commit the whole set is appended to the log buffer. Types of records include MLOG_1BYTE for byte-level updates, MLOG_WRITE_STRING for string writes, MLOG_REC_INSERT for row inserts, MLOG_COMP_REC_INSERT_8027 for compressed formats, and dozens of others.
SQLite's journaling mode has two variants. The original one is the rollback journal, which is not quite a WAL but a pre-image log: before modifying a page, copy the original contents to the journal, then modify the page. On crash, recovery copies the journal pages back to undo. The modern WAL mode, added in SQLite 3.7.0 in 2010, is a proper WAL: new page contents are appended to a separate WAL file, and reads merge WAL contents with the base file. SQLite's WAL is simpler than Postgres's because SQLite is a single-writer database, so there is no need for transaction IDs or complex concurrency tracking in the log.
The Log Sequence Number
Every log record has a position in the log, called a Log Sequence Number (LSN) in Postgres and InnoDB, or a frame number in SQLite WAL mode. The LSN is typically implemented as a byte offset from the start of the log, which means LSNs are monotonically increasing and can be compared with ordinary integer arithmetic. In Postgres, an LSN is printed as 0/1A34B5C8: two 32-bit halves separated by a slash, representing the high and low 32 bits of a 64-bit offset.
LSNs are the universal timestamp of the WAL world. Every data page header contains the LSN of the most recent log record that modified it (pd_lsn in Postgres). Every buffer pool entry tracks its dirty-LSN, the earliest LSN of any change not yet flushed. Replication positions are expressed as LSNs. Recovery starts from an LSN (the start of the last checkpoint) and replays forward until either the log ends or a designated recovery target LSN.
The crucial invariant is: a data page must never be written to disk with an LSN higher than the log has been flushed to. If the page's LSN is 1000 but the log on disk only goes up to LSN 900, and the machine crashes, we have a data page on disk that references a change not recorded in the log, and we cannot correctly recover it. To prevent this, the buffer pool manager compares every dirty page's LSN against the on-disk log flush position before writing it out, and blocks the write until the log has caught up. This is the most literal interpretation of "write-ahead": the log always leads the data.
fsync: The Most Important Syscall In Databases
The log is only durable if it is actually on stable storage, not just sitting in the kernel page cache waiting to be written back. The tool for making this happen is fsync(fd), the syscall that asks the kernel to push all dirty pages for a file to the underlying device and wait for the device to acknowledge completion.
fsync has a complicated history. For decades, nearly every database assumed that fsync behaved exactly as advertised: when it returned, the data was on disk. In 2018, a paper called "Protecting Files and Filesystems from fsync()" plus some extensive testing by Postgres developers revealed that Linux's fsync had edge cases where it could return success even after a write-back error. If a dirty page failed to flush (say, a transient storage error), the kernel cleared the dirty bit anyway, and the next fsync on the same file would return success. Data could be lost without any error being visible to the database. This episode, known as "fsyncgate", led to a flurry of changes in both the kernel and in Postgres and other databases, including the PANIC on fsync error approach that Postgres now takes: if fsync ever returns an error, the database crashes immediately rather than trusting the kernel state.
Even when fsync works correctly, there is another layer of lies beneath it. Many SSDs and hard drives have internal write caches that improve performance by deferring actual physical writes. A naive fsync that reaches the drive but returns before the drive has actually written to NAND is not a durable fsync. The fix is the FUA flag (Force Unit Access) or a cache-flush command that the kernel sends after the write. Linux's block layer handles this when the filesystem and drive both support the right commands, but budget SSDs sometimes lie about completion even here, which is why "enterprise-grade SSDs with power-loss protection" are a real category of drive and a real premium paid by database operators.
The upshot is that a correct WAL implementation's commit path looks like this:
- Buffer the log records in memory (the "log buffer").
- When a transaction commits, copy the buffer's contents to the WAL file using
writeorpwrite. - Call
fsync(orfdatasync, which is slightly faster) on the WAL file. - Once fsync returns, tell the client the transaction is committed.
Step 3 is the bottleneck. On a datacentre-grade NVMe with power-loss protection (a Samsung PM1735 in an Amsterdam colo, say), an fsync can take 30 to 100 microseconds. On a consumer NVMe with the cache flushed every time, it can take 1 to 5 milliseconds. On a spinning disk, 5 to 20 milliseconds. Multiply that by your transaction rate to get your maximum single-thread throughput, and you can see why database performance is often explicitly bottlenecked on fsync rate.
Group Commit: Amortising The fsync
One fsync per commit is crushing under any real workload. If you have 1000 concurrent transactions trying to commit and each one does its own fsync, you might as well give up. The fix is group commit: batch multiple commits into a single fsync.
The mechanism is straightforward. When a transaction reaches the commit point, it adds its commit record to the log buffer and then waits on a condition variable. A dedicated log writer thread periodically (or when the buffer fills) takes the accumulated buffer, issues one write and one fsync, and then signals all waiting transactions that their records are durable. Every transaction in the group pays approximately the same latency as if it were the only one committing, but the fsync cost is amortised over however many transactions were waiting.
Postgres implements this with the commit_delay setting (microseconds to wait before flushing, to gather more commits into one batch) and commit_siblings (minimum number of active transactions before waiting). InnoDB does the same with innodb_flush_log_at_trx_commit (where 1 is "fsync on every commit", 2 is "write on every commit but fsync once per second", 0 is "write and fsync once per second", all of which trade durability for throughput). MySQL's binary log adds a second fsync for the binlog, and the whole "two-fsync commit" has been the subject of extensive optimisation in MariaDB and MySQL over the last decade.
Under sustained load, a well-tuned Postgres instance on fast NVMe can sustain tens of thousands of committed transactions per second on a single core, with most of the work being "wait for the log writer to finish the next batch". On extreme hardware (NVMe over RDMA, Optane persistent memory, or a battery-backed NVDIMM), you can push the fsync latency into the single-digit microsecond range and hit hundreds of thousands of commits per second per shard.
Checkpoints: Bounding Recovery Time
The log grows forever, which sounds expensive and frightening. In practice, it does not, because of checkpoints. A checkpoint is a synchronisation point that marks a place in the log where all prior changes have been flushed to the data files. After a checkpoint completes, the log records before that point are no longer needed for recovery: any crash from now on can start recovery at the checkpoint LSN and replay forward.
Postgres's checkpoint process looks like this:
- Freeze the current log position as the checkpoint start LSN.
- Walk the buffer pool, writing every dirty page whose LSN is below the start LSN to the data files. This is done gradually over
checkpoint_timeoutto avoid an IO spike. - Once all such pages are flushed, fsync the data files to ensure the writes are durable.
- Write a
CHECKPOINTrecord to the log containing the start LSN. - Update
pg_control(the small metadata file at the root of the data directory) to record the new checkpoint LSN. - Recycle or remove WAL segments whose records are entirely below the checkpoint start LSN.
The old WAL segments can now be deleted (or archived if continuous archiving is enabled). A Postgres instance can run for weeks without its WAL directory growing, as long as checkpoints are completing on schedule and archive_command is keeping up.
The cost of a checkpoint is the IO burst of flushing dirty pages. On a busy database, the checkpoint can become a pain point: the system must write out all accumulated dirty pages, which causes a sudden spike in write IO that contends with the regular workload. Postgres has a checkpoint_completion_target parameter that spreads checkpoint IO over most of the interval between checkpoints, so that the burst becomes a gentle continuous hum instead of a cliff. The default is 0.9, meaning the checkpointer tries to finish its work by 90% of the way into the next interval.
Checkpoints are also an opportunity to insert full-page writes into the log. After a checkpoint, the first time a given data page is modified, Postgres writes the entire page contents to the WAL instead of just the delta. This protects against torn writes during recovery: even if the crash left a half-written page on disk, the full-page image in the WAL can reconstruct it. Between checkpoints, subsequent changes to that page can be deltas again, because the known-good full page from the start of the interval is available. Full-page writes are a significant portion of WAL volume on update-heavy workloads, and disabling them (full_page_writes = off) is the single biggest performance lever in Postgres if you trust your storage to be torn-write-free, which very few storage systems really are.
Recovery: Reading The Log Backwards And Forwards
When a crashed Postgres instance starts up, it goes through recovery before accepting any connections. The process is deliberate and mostly follows the ARIES recovery algorithm.
Analysis pass. Read the log from the last checkpoint forward, scanning for transaction records. Build an in-memory list of transactions that were in-progress at the crash, and the LSN of the earliest dirty page referenced by the log (the "redo LSN"). This identifies which transactions need to be rolled forward and which need to be rolled back.
Redo pass. Start from the redo LSN and replay every log record forward. For each record, check whether the affected page is already at or past this LSN (in which case the change is already reflected in the on-disk version) or behind (in which case apply the change). At the end of the redo pass, every data page reflects all the committed-or-not changes in the log up to the end of the log.
Undo pass. Walk the in-progress transaction list in reverse and roll back every change made by those transactions. In ARIES this is done by reading each transaction's log records in reverse (using the xl_prev pointers) and writing compensation log records (CLRs) that undo the effects. In Postgres, there is no explicit undo pass for MVCC reasons: uncommitted row versions are simply not visible to readers because their xmin is an aborted transaction, and they get cleaned up later by VACUUM.
SQLite's recovery is much simpler because it is single-writer. On open, SQLite checks whether the WAL file exists and is non-empty. If so, it replays the WAL frames into the main database file (a "checkpoint" in SQLite terminology, which is a slightly different use of the word from Postgres's) and then truncates the WAL. If the process crashed mid-checkpoint, the next open re-runs the checkpoint from the current WAL state.
InnoDB's recovery is closer to ARIES. It reads the last checkpoint LSN from the system tablespace header, walks forward through the redo log applying each record, and then reads the undo tablespace to roll back any transactions that were in-progress at the crash. The time to recover is bounded by innodb_log_file_size times some function of how dirty the buffer pool was at crash time; operators tune the log file size to trade longer crash recovery against better steady-state throughput.
The recovery step is the most security-sensitive and test-intensive part of a database's code. Every edge case matters: partial log records at the end of the log (which are treated as corruption boundaries and ignored), CRC failures (which are treated as torn tail bytes and also ignored), missing log segments (which are an error because the log is non-contiguous), all need to be handled correctly. A single wrong decision in recovery can lead to silent data loss. This is why databases test recovery exhaustively with fault-injection frameworks like Jepsen and internal chaos engineering.
Replication: The WAL Travels Over The Network
Once you have a WAL, you have almost all the machinery needed for physical replication. The WAL is a complete, byte-exact description of every change in the database. If you ship it to another machine and replay it there, you get a byte-exact copy of the database. That is precisely how Postgres streaming replication, InnoDB's "replica" mode, and MySQL's semi-synchronous replication all work.
In Postgres, the primary has a background process called the walsender that reads the WAL as it is produced and streams it to a walreceiver on each standby. The standby writes the received records to its own WAL, calls fsync, and replays them against its own data files. Replication can be asynchronous (the primary does not wait for the standby before acknowledging commits) or synchronous (the primary waits for at least one standby to persist the commit record before acknowledging).
Synchronous replication turns the durability story from "on stable storage on one machine" into "on stable storage on at least two machines". A primary in a Frankfurt data centre can replicate to a standby in Stockholm, and a commit acknowledged by the primary is guaranteed to survive the loss of either site. The cost is latency: every commit waits for a round trip to the standby, which adds milliseconds for anything beyond the same rack.
Postgres also supports logical replication, where the WAL is decoded into a stream of high-level change events (inserts, updates, deletes) that can be applied to a different schema, a different version, or a different database engine entirely. Logical decoding reads the same WAL but runs output plugins that translate raw record bodies into logical events. This powers change-data-capture systems, zero-downtime upgrades, and the many Kafka connectors that bridge Postgres and downstream pipelines.
WAL-Only Storage Engines: RocksDB And LSM Trees
RocksDB is not a relational database but it uses WAL principles aggressively. A RocksDB instance is a log-structured merge-tree: all writes go first to a memtable (an in-memory sorted structure) and to a WAL on disk. When the memtable reaches a size threshold, it is flushed as an immutable SSTable file, and the WAL for that memtable is discarded. The SSTable files accumulate and are periodically compacted into larger, higher-level files by background threads.
The WAL in RocksDB is the same idea as in Postgres: each write is logged sequentially before the memtable is updated, so that on crash the memtable can be rebuilt by replaying the WAL. RocksDB's commit options are similar too: sync writes fsync every batch, async does not. RocksDB also supports two-phase commit across multiple column families for transactional consistency.
LMDB, the Lightning Memory-Mapped Database, is a different beast: it is a read-optimised B-tree database with no WAL at all. Instead, LMDB uses a copy-on-write technique where every write creates a new version of the affected pages, and a single-page header update at the root swaps the root to the new version. The root update is atomic because a single 4 KiB page write is atomic on modern storage. On crash, LMDB simply opens the old root and the database is consistent. LMDB pays for this simplicity with fragmentation (old pages are never overwritten until they are reclaimed) and with a single-writer model (readers are wait-free and lockless, but only one writer can be active at a time).
Both designs have their place. WAL-based systems give you high write throughput with complex concurrency. COW systems give you beautifully simple recovery at the cost of space amplification. The market has settled on WAL for OLTP (Postgres, InnoDB, Oracle, SQL Server), LSM + WAL for ingest-heavy stores (RocksDB, Cassandra, ScyllaDB), and COW for embedded and read-dominated workloads (LMDB, BoltDB, Firefox's Places database).
Things That Can Still Go Wrong
Write-ahead logging is a very good defence, but it does not protect against every failure. It is worth listing the classes of bug that defeat WAL, because the real failure modes of real databases are concentrated in this list.
Bugs in the commit protocol. If a database declares a transaction committed before the log record is actually fsynced, a crash between the ack and the fsync loses the commit. This has happened in real databases, usually during optimisation rewrites that got the ordering wrong.
fsync returning success when it should not. The fsyncgate discovery showed that Linux and other kernels had subtle cases where fsync lied about durability. The aftermath produced a body of work on "what should fsync actually mean" and tighter error-handling contracts.
Storage that ignores cache flushes. Some consumer SSDs and even some enterprise drives have been found to ignore cache-flush commands under certain conditions. Databases running on such hardware can lose committed transactions even when the database and the kernel both did their jobs correctly. Power-loss-protected drives (enterprise NVMe with supercaps) are the proper defence.
Byte-level corruption from cosmic rays and cables. Rare but real: a bit flip somewhere in memory, in the bus, or in the drive can produce a log record that passes CRC and parses but contains wrong data. ECC memory, ECC-aware drives, and per-record checksums all narrow the window.
Incomplete replication acknowledgement. A replica that claims it has persisted a commit but has not yet fsynced can lose data if it crashes after the primary moves on. This is a known class of bug in several replication implementations and is addressed by ensuring replicas fsync before acknowledging.
Human error. Accidentally running rm -rf pg_wal is, empirically, the most common cause of data loss in Postgres installations. Backups, WAL archiving, and mandatory code review of shell scripts are the defences. No amount of WAL design protects against an operator who deletes the log.
Point-In-Time Recovery
Continuous archiving of the WAL unlocks one of Postgres's most underappreciated features: point-in-time recovery (PITR). If you have a base backup (from pg_basebackup or pgBackRest or Barman) plus every WAL segment produced since that backup, you can restore the database to any instant between the backup and the present.
The mechanism is simple in outline, subtle in practice. Restore the base backup into an empty data directory. Create a recovery.signal file (or set recovery_target_time = '2026-04-06 14:35:00 CEST' in postgresql.conf in modern versions). Start Postgres. Recovery replays the WAL from the base backup's starting LSN forward. When it reaches the target time, it pauses recovery, promotes the instance to read-write, and you have a fully consistent database as it existed at 14:35.
This turns the WAL from a recovery tool into a time machine. "We accidentally ran a DELETE without a WHERE clause at 14:42" becomes a recoverable situation: restore to 14:41, dump the affected table, restore the production database to current, reinsert the dumped rows. Many operations teams in European banks have saved themselves this exact way. The archive command (usually rsync or an S3-compatible upload) must have kept up with WAL production, which requires monitoring, but the operational payoff is enormous.
Logical backup tools like pg_dump do not give you this. A pg_dump taken at 02:00 can only restore to 02:00. PITR is the only mechanism that lets you pick an arbitrary moment, and it is entirely built on the idea that the WAL is a complete history of the database that can be replayed against any consistent snapshot.
Two-Phase Commit And Distributed WAL
Single-node durability is hard. Distributed durability across multiple nodes is hard in different ways, and it leans even more heavily on the WAL.
The classical algorithm is two-phase commit (2PC). A coordinator asks every participant "can you commit?" Each participant prepares the transaction, writes a PREPARE record to its WAL, fsyncs it, and replies "yes". When all yes replies are in, the coordinator writes a commit decision to its own log, fsyncs it, and tells everyone to commit. Each participant writes a COMMIT record, fsyncs it, and applies the change visibly.
The critical property is that the PREPARE record is durable. If the participant crashes between the prepare and the commit decision, recovery will find the PREPARE record and leave the transaction in the "prepared, waiting for coordinator" state, asking on startup for the decision. The coordinator's log is the source of truth on whether to commit or abort. As long as the coordinator survives, every participant can eventually learn the outcome. If the coordinator dies mid-decision, the protocol blocks until it recovers. This is the famous "blocking" property of 2PC and why it is used cautiously: a lost coordinator can leave prepared transactions holding locks forever.
Postgres supports 2PC via the PREPARE TRANSACTION statement, which writes a PREPARE record to the WAL and leaves the transaction in a prepared state visible in pg_prepared_xacts. XA transaction managers like Atomikos, Narayana (Red Hat), and the JTA implementations in Java application servers use this to coordinate transactions across multiple databases.
Modern distributed databases have largely moved past 2PC for the common case, using consensus protocols like Raft and Paxos to replicate the WAL directly. In CockroachDB, every range has a Raft group, and every write to the range goes through Raft: the leader proposes a log entry, followers durably append it to their own logs, and once a quorum has persisted the entry, it is "committed" and can be applied. The WAL and the Raft log are effectively the same thing. YugabyteDB and TiDB work similarly. Spanner uses Paxos for the same role.
The conceptual step from single-node WAL to Raft log is surprisingly small: instead of fsync making a record durable on one machine, quorum write makes it durable across machines. The rest of the recovery story (replay on restart, checkpoints, snapshot installation for slow followers) all carries over. A Raft group with fast disks and a fast network behaves much like a very durable single-node database.
Hot Standby And Read Scaling
A WAL replica does not have to be passive. Postgres's hot standby feature lets replicas serve read-only queries while simultaneously applying the WAL stream from the primary. This turns the replica from a pure backup into a read scaling tool: route reads to the standby, writes to the primary, and let the WAL keep them in sync.
The implementation has a tension. The primary is constantly producing changes that the replica must apply. Long-running read queries on the replica want the data pages to stay still. If the WAL apply tries to drop a snapshot row that an open transaction on the replica is still reading, you get a conflict. Postgres handles this by letting the replica either delay WAL apply (max_standby_streaming_delay) or cancel the reading query (ERROR: canceling statement due to conflict with recovery). The trade-off is either slower replica lag or occasional query cancellations.
For analytics workloads in a Berlin fintech that runs its reporting on a standby, both sides of this trade-off matter. Long reports want to run to completion, which argues for larger delays. Low replica lag is important for features that read the standby right after writing to the primary, which argues for smaller delays. Operators tune per-workload.
The WAL is also what makes cascading replication work: a standby can itself stream WAL to a downstream standby, forming a tree rather than a star. This reduces the network load on the primary when many standbys are needed (for example, a read-heavy application with a dozen region-local replicas). The WAL in this case flows through intermediate hops, each of which is persisting and re-streaming it.
Backup Consistency And The Concept Of A Frozen Snapshot
A backup of a database taken by copying its data files while the database is running is almost certainly inconsistent. The file you copy first is from time T, the file you copy last is from time T+30 seconds, and the two together do not represent any single moment. If you restore them and start the database, recovery will be confused.
The WAL is what makes online backup consistent. The standard pattern, called base backup, is:
- Call
pg_backup_start('my-backup'). This writes a special checkpoint record and returns the LSN of the backup's starting point. - Copy the data files in any order, at any speed. They will be inconsistent with each other, but that is fine.
- Call
pg_backup_stop(). This returns the LSN of the backup's ending point. - Archive the WAL segments from the start LSN through the end LSN alongside the data files.
On restore, Postgres recognises a base backup from the metadata, starts recovery at the start LSN, replays WAL forward through the end LSN, and at that point the restored data files are consistent with the end-of-backup state. Recovery can then continue further if additional WAL is available, which is how PITR stacks on top of base backup.
The beautiful thing about this is that step 2 is allowed to be arbitrary. You can use filesystem snapshots, rsync, tar, ZFS send/receive, or even a Btrfs snapshot. As long as the files copied are consistent with themselves (not partial reads mid-write, which filesystem snapshots guarantee) and the WAL archive is complete, recovery will straighten everything out. This is how pgBackRest, Barman, and WAL-G all work under the hood, plus a lot of smart parallelism and compression.
Practical Knobs Worth Knowing
If you operate a Postgres database, these are the WAL-related settings most likely to affect your life.
wal_level: minimal, replica, or logical. Controls how much is logged.replicais the default and is needed for streaming replication.logicaladds extra info for logical decoding.synchronous_commit: on, off, local, or remote_apply. Tradeoffs on whether commits wait for local fsync, remote fsync, or not at all.wal_buffers: size of the in-memory log buffer. Default is 1/32 of shared_buffers up to 16 MiB, which is usually fine.max_wal_sizeandmin_wal_size: target and minimum total size of the WAL directory. Checkpoints are triggered when the log grows pastmax_wal_size.checkpoint_timeout: maximum time between checkpoints. Default 5 minutes.full_page_writes: almost certainly leave this on. Off is for storage that is absolutely guaranteed torn-write-free.archive_modeandarchive_command: continuous archiving of WAL segments to another location, typically for point-in-time recovery.wal_compression: compresses full-page images in the WAL. Usually a win on CPU-rich machines with bandwidth-constrained storage.
For InnoDB the analogous settings are innodb_log_file_size, innodb_flush_log_at_trx_commit, innodb_flush_method, and sync_binlog. For SQLite, it is PRAGMA journal_mode = WAL plus PRAGMA synchronous = NORMAL (the reasonable default for most applications) or FULL (for maximum safety).
Watching A Real WAL Record
Nothing makes this concrete like looking at real bytes. Postgres ships a tool called pg_waldump that decodes WAL segments into human-readable form. On a running instance:
# Find the current WAL position
psql -c "SELECT pg_current_wal_lsn();"
# pg_current_wal_lsn
# 0/1A34B5C8
# Find the file that LSN lives in
psql -c "SELECT pg_walfile_name('0/1A34B5C8');"
# 000000010000000000000001
# Dump it
pg_waldump $PGDATA/pg_wal/000000010000000000000001 | headThe output looks like this:
rmgr: Heap len (rec/tot): 54/ 160, tx: 523, lsn: 0/01A34000,
prev 0/019FFFF8, desc: INSERT off 23 flags 0x00, blkref #0: rel 1663/5/24576 blk 4 FPW
rmgr: Btree len (rec/tot): 72/ 168, tx: 523, lsn: 0/01A340A0,
prev 0/01A34000, desc: INSERT_LEAF off 9, blkref #0: rel 1663/5/24577 blk 2
rmgr: Transaction len (rec/tot): 34/ 34, tx: 523, lsn: 0/01A34148,
prev 0/01A340A0, desc: COMMIT 2026-04-06 14:35:02.123456 CESTEvery line is one log record. The first one is a heap insert (inserting a row) with a full-page image (FPW) for block 4 of relation 24576 because this was the first touch of that page after a checkpoint. The second is a btree insert recording the corresponding index entry at leaf block 2 of relation 24577. The third is the transaction commit record that makes the insert durable and visible. Three records, one transaction, one fsync.
If you imagine hundreds of thousands of records per second flowing through a WAL on a busy database, each one a tiny structured packet describing exactly what changed, each one tagged with an LSN and a transaction ID, you have the right picture of what a database is doing under the hood. The data files are almost a side effect of the log, not the other way around.
How Much WAL Is "A Lot"
A rule of thumb for typical Postgres: WAL volume is roughly 2x the size of the logical changes made, after full-page writes, with spikes right after each checkpoint (when the first touch of each page produces a full-page image). A database processing 100 MiB of inserts and updates per hour typically produces 200-400 MiB of WAL per hour.
On a production system in Helsinki or Rotterdam running modest OLTP, WAL rates of 10-50 GiB per day are normal. On heavy update workloads, 100 GiB per day is not unusual. On a high-volume event store built on Postgres (there are more of these than people expect), 500 GiB per day is reachable. Every gigabyte of WAL needs to be streamed to replicas, archived to backup storage, replayed on standbys, and eventually recycled. Operations teams watch WAL rate alongside CPU and IO because it is a leading indicator of write load.
The archive step in particular can become a bottleneck. If your archive_command uploads WAL to S3, and your upload takes longer than you produce new WAL, your primary's pg_wal directory starts filling up. If it fills completely, the primary stops accepting writes. A well-tuned system either uses parallel archival (multiple concurrent uploads), a dedicated fast archive tier (NVMe staging with background sync to S3), or a smaller retention window at the primary and reliance on the archive for any WAL more than an hour old.
Why WAL Survives Where Other Ideas Did Not
WAL is old, and databases have tried other durability strategies over the years. Shadow paging, which keeps an old tree root around so any write creates a new page and atomic root swaps commit all changes, lost popularity because it fragmented storage and made writes expensive. Pure physical logging of every byte change scaled poorly as record sizes grew. Logical-only logging (just record the SQL, no physical details) required deterministic replay which was hard to guarantee across versions. Various "log-structured" full-database designs put everything in one giant log but ran into read amplification.
The WAL approach in the form popularised by ARIES split the difference in exactly the right place. The log is physical enough to handle crash recovery without needing the application layer involved, but small enough that writes are cheap. The data files are separate, so reads go directly to them without scanning the log. Checkpoints bound the recovery cost. Full-page writes handle torn pages. Compensation records (or MVCC alternatives) handle rollback. Every piece has a reason.
This is why, thirty years after ARIES, the basic architecture is unchanged in all major OLTP databases. The details differ: Postgres uses heap tables with separate indexes, InnoDB uses clustered index tables, SQLite is single-writer, Oracle uses undo tablespaces, SQL Server uses a log service separate from the storage engine. But the commit protocol, the log format, the recovery algorithm, and the checkpoint logic all descend from the same intellectual tradition, and if you understand how Postgres's WAL works you can usually decode the other databases' logs after about twenty minutes of reading their documentation.
The Mental Model Worth Keeping
A database's WAL is a single append-only narrative of the database's history. It is the source of truth. The data files are a materialised snapshot of some point along that narrative, always lagging the log, eventually catching up. On commit, the log is flushed; that is what makes the transaction durable. On crash, the log is replayed; that is what makes recovery work. Between commits, the log is streamed to replicas; that is what makes high availability possible. Between checkpoints, portions of the log are recycled; that is what keeps the log bounded.
Almost every question you can ask about database durability has an answer in terms of the WAL. "What happens if the machine crashes between the commit ack and the fsync?" The log was already flushed before the ack, so the transaction survives. "What happens if a data page write is torn?" Full-page writes after the most recent checkpoint reconstruct it from the log. "What happens if a replica lags?" The primary keeps more of the log around until the replica catches up. "What happens if the filesystem fails an fsync?" The database should PANIC rather than trust the kernel state. "What happens if the log fills up?" A background checkpointer is supposed to keep pace; if it cannot, the database stops accepting writes until space frees up.
The log is also the mental tool for debugging. Every major database ships tools to dump and inspect WAL records: pg_waldump for Postgres, mysqlbinlog for MySQL's binlog, innochecksum and custom tools for the InnoDB redo log, sqlite3_analyzer for SQLite's WAL file. When something mysterious happens, reading the log is often the quickest way to understand what actually changed and in what order.
Write-ahead logging is the least glamorous and most important idea in database systems. It is what turns a collection of pages on a disk into a durable transactional store. It is why you can kick the cable out of a server in a Warsaw data centre and come back to an intact database. Every time you type COMMIT, the sequence of events underneath it is the same: log record written, log buffer flushed, fsync called, kernel sends the write to the drive, drive acknowledges, fsync returns, commit acknowledged to the client. Each link in that chain is a small, testable, understandable piece of engineering. Together they deliver the promise that the "D" in ACID was asking for.
That promise is worth understanding in detail, because when you do, the rest of database internals starts to fall into place. Transactions, isolation levels, replication, point-in-time recovery, hot standbys, streaming changes: they are all built on top of the log. Learn the log and you have learnt a large fraction of what makes a database a database.
There is one last thing worth saying. The WAL is the place where a database's promises intersect with the messy reality of hardware. Every other piece of a database (query optimiser, planner, indexes, parser) lives in a clean world of algorithms and data structures. The WAL lives in the dirty world of spinning rust, NAND flash, kernel schedulers, PCIe queues, CRC mismatches, and the actual physical speed of light. Every line of code in a WAL implementation carries the weight of that reality, and every bug fixed in that code is someone having learned, usually the hard way, that some assumption about hardware was wrong.
That is why the WAL is treated with such care by the people who maintain major databases. It is the one place where a small mistake can silently destroy user data. It is also the one place where getting it right buys you a remarkable property: a system that survives almost any physical calamity you can throw at it. Hold that thought next time you see a commit succeed in under a millisecond. Behind that acknowledgement is thirty years of engineering, a careful sequence of syscalls, a drive's firmware fulfilling its side of the contract, and the deep conviction that if the lights go out right now, everything will still be there tomorrow.
When you build something on top of a database, you inherit that conviction without ever thinking about it. That is the whole point of the abstraction. But the abstraction is held up by thousands of engineers who do think about it, constantly, and who test it against Jepsen reports and power-loss rigs and random kernel panics and flaky SSDs, and who care very much that the next log record written is the next log record recovered. That caretaker culture is why databases are, on the whole, astonishingly reliable for how complicated they are. Every time a transaction commits and the client carries on with its day, a piece of that culture has done its job. And it started, more or less, with the decision to write the log before the data.