← Back to Logs

How Copy-on-Write Actually Works

Try the interactive lab for this articleTake the quiz (6 questions · ~5 min)

Copy-on-write is one of those phrases that appears everywhere and means just enough different things to confuse even experienced engineers. In a Linux process it means two page tables temporarily point at the same physical page until one side writes. In Docker it usually means a lower immutable layer is shared until a container needs a private upper-layer copy. In Btrfs and ZFS it means old blocks stay where they are while new writes allocate fresh blocks and metadata points to the new version. In a database discussion it often gets mixed together with MVCC, where several transaction versions of a row can coexist even though the storage engine may not literally duplicate entire pages on first write.

The family resemblance is real. In every case the system avoids making an expensive copy before it is sure the copy is needed. Shared state remains shared while reads dominate. The first write is the expensive moment. That economy is powerful because many workloads contain far more reads than writes, and many objects that might have been copied speculatively never end up diverging at all.

The phrase hides a lot of machinery, though. Linux page-table copy-on-write lives in the MMU, the page-fault handler, and the kernel's physical-page reference counts. Overlay filesystems use directory trees, whiteouts, and copy-up rules. Snapshotting filesystems use transaction groups or trees of block pointers. Database concurrency control usually lives in tuple headers, undo records, version chains, and the write-ahead log. These systems rhyme, but they do not work the same way.

This article starts with the canonical case, fork() on Linux, because it is the clearest place to see true memory copy-on-write. From there we can widen the view to container layers, snapshotting filesystems, and databases. The goal is not only to define the term, but to show exactly where the write path changes, which metadata structures carry the burden, and why this lazy duplication trick shows up in so many performance-critical systems.

The Core Idea Is Deferred Duplication

Suppose a process has 2 GiB of anonymous memory. It calls fork(). The child begins life as a near-exact copy of the parent. If the kernel literally copied all 2 GiB at the instant of fork(), process creation would be painfully expensive. It would burn memory bandwidth, increase latency, and often copy pages that the child will throw away moments later with execve().

Copy-on-write changes the contract. Instead of copying data pages immediately, the kernel duplicates the page tables and leaves the physical pages shared. Both parent and child point at the same frames. The shared mappings are marked read-only. If either process later writes to one of those pages, the CPU raises a protection fault. The kernel handles the fault by allocating a fresh physical page, copying the old contents into it, updating the faulting process's page table entry to point at the new page with write permission, and resuming execution. Only the page that was actually written gets copied.

That gives the system three important wins.

First, fork() becomes much cheaper. Duplicating page tables is not free, but it is dramatically cheaper than copying every mapped page. On a modern x86_64 Linux system, cloning the top-level tables and bumping page reference counts is usually fast enough that fork()+exec() remains a practical primitive even in large processes.

Second, execve() becomes economical. Many child processes call execve() immediately after fork(). The child replaces its whole address space with a new program image, so copying the old address space would have been wasted effort. Deferred duplication avoids that waste almost entirely.

Third, memory footprint stays compact while parent and child mostly read. If the child is doing setup work, changing file descriptors, and then launching a new image, only a handful of bookkeeping pages may need private copies. The rest remain shared until the old mappings disappear.

The same logic appears outside process memory. Snapshotting a filesystem at noon does not copy 10 TiB of blocks. It creates new metadata roots pointing at the same blocks and waits. A write at 12:03 allocates a new block only for the part that changed. Container image layers share lower content across many containers and only copy a file into the writable layer when modification actually happens. Deferred duplication is the whole point.

What fork() Really Copies

To understand Linux copy-on-write properly, it helps to be explicit about what is and is not duplicated during fork().

The kernel duplicates the process descriptor (task_struct), scheduling state, file descriptor table references, credentials, signal-handling structures, and the memory descriptor (mm_struct). It also duplicates the hierarchy of virtual memory areas or VMAs, which describe ranges such as the executable text segment, the heap, the stack, shared libraries, and anonymous mmap() regions.

The kernel does not immediately duplicate the anonymous data pages those VMAs describe. Instead, fork() creates a second set of page tables that map the same physical frames. Both sides see the same bytes. Physical page reference counts are incremented so the kernel knows the frames are shared.

On Linux the fork() path eventually reaches copy_mm() and then dup_mmap(), which replicates the VMA metadata. The page-table duplication path walks the parent's tables and installs corresponding entries in the child. For writable private mappings, Linux converts the entries in both parent and child to read-only and marks them as copy-on-write candidates. The "writeable in the abstract, read-only until fault" distinction matters. The VMA may still conceptually allow writing, but the page-table entries are deliberately downgraded to force a fault on the first write.

This is why tools such as /proc/<pid>/smaps show memory categories like Shared_Clean, Shared_Dirty, Private_Clean, and Private_Dirty. Right after fork(), many anonymous pages that were previously private become effectively shared until one side diverges.

The kernel also has to deal with page-table pages themselves. On x86_64 with four-level paging, a large process may have many megabytes of PTE pages. These topological structures are duplicated during fork(), because parent and child need independent page tables from the start. That duplication is a real cost. It is the reason fork() is not free even when data pages are shared, and it is one reason huge address spaces with sparse mappings can still make process creation slower than people expect.

The Hardware Side, Page Tables, Permissions, And The Fault

Copy-on-write only works because the CPU's memory-management hardware cooperates. The kernel needs a way to notice the first write without instrumenting every store instruction. The page table gives it that hook.

Take a normal anonymous page that belonged only to the parent. Before fork(), the leaf page-table entry might be:

virtual page 0x7f3c45a72000 -> physical frame 0x18a42
flags: present, user, writable, accessed, dirty

After fork(), Linux changes both parent and child entries:

parent virtual page 0x7f3c45a72000 -> physical frame 0x18a42
child  virtual page 0x7f3c45a72000 -> physical frame 0x18a42
flags on both: present, user, read-only, accessed
page refcount on frame 0x18a42: 2
mapping semantics: writable private VMA, copy-on-write on fault

Now suppose the child executes:

buf[0] = 'X';

The store hits a present mapping, but the page-table entry does not permit writing. The MMU raises a page-fault exception, with an error code that says this was a protection fault on a write access from user mode. The CPU records the faulting virtual address in a register such as CR2 on x86_64 and traps into the kernel.

Linux's page-fault handler examines the fault:

  1. Was the address inside a valid VMA?
  2. Was the VMA logically writable?
  3. Is this a private mapping that should trigger copy-on-write rather than a genuine protection error?
  4. Is the underlying page shared, pinned, swapped out, huge, or otherwise special?

If the answer is "this is the first write to a CoW page", the kernel allocates a fresh physical page, copies 4 KiB of data from the old frame, updates the child's page-table entry to point at the new frame with write permission, decrements the old page's refcount, and resumes the instruction. The write then retries and succeeds.

The parent still points at the original frame. From this point onward the pages have diverged.

That single fault path is the essence of memory copy-on-write.

A Concrete fork() Example On Linux

It is easier to see this with a tiny program:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
 
int main(void) {
    size_t len = 4096 * 4;
    char *buf = mmap(NULL, len, PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (buf == MAP_FAILED) return 1;
 
    memset(buf, 'A', len);
    pid_t pid = fork();
    if (pid < 0) return 2;
 
    if (pid == 0) {
        buf[0] = 'X';
        buf[4096] = 'Y';
        printf("child:  %c %c\n", buf[0], buf[4096]);
        _exit(0);
    }
 
    waitpid(pid, NULL, 0);
    printf("parent: %c %c\n", buf[0], buf[4096]);
    return 0;
}

The parent initialises four pages to 'A'. The child writes into two of them. The output will be:

child:  X Y
parent: A A

From user space it looks like the child received a separate copy at fork() time. Under the hood only the two pages the child wrote had to be duplicated.

If you want to watch this effect on Linux, the easiest observable signals are:

rtk proxy ./cow-demo &
rtk grep -n "Private_Dirty|Shared_Dirty|Rss|Pss" /proc/$PID/smaps
rtk proxy perf stat -e minor-faults,major-faults ./cow-demo

The child writes trigger minor faults, not major faults, because the source pages are already in RAM. No disk IO is needed. The kernel only needs to allocate and copy private pages.

You can also inspect aggregate counters:

rtk proxy vmstat 1
rtk grep -n "pgfault|pgmajfault|cow_ksm" /proc/vmstat

pgfault rises rapidly on active systems for all kinds of reasons, not just CoW. The point is that copy-on-write is visible as page-fault activity because the kernel uses the fault path to lazily materialise private copies.

Refcounts, Reverse Mapping, And Why The Kernel Knows It Can Share

The kernel cannot safely share physical pages unless it can track how many mappings refer to them and who those mappings belong to. Linux uses several structures for this.

Each physical frame has a struct page or folio metadata object with a reference count. If two processes share the frame after fork(), the refcount goes up. When one mapping goes away or gets replaced by a private copy, the refcount drops. Once the last reference disappears, the page can be reclaimed.

For anonymous memory Linux also uses reverse mapping, often called rmap. Instead of only knowing "this virtual address points to this physical page", the kernel also needs a way to find the VMAs and PTEs that point back to a page. This matters for reclaim, migration, NUMA balancing, KSM deduplication, and copy-on-write corner cases. Rmap lets the kernel locate all mappings that refer to a given page when it needs to adjust permissions, migrate the page, or break sharing.

This metadata is what distinguishes a carefully engineered CoW system from a hand-wavy explanation. Shared pages are not magic. They are ordinary physical pages with extra bookkeeping:

  • a higher refcount
  • page-table entries in more than one address space
  • reverse mappings to find those PTEs later
  • permission bits set to trigger a write fault rather than silent mutation

Without those pieces the kernel could not safely defer duplication.

CoW And execve(), The Reason fork()+exec() Is Still Practical

The canonical justification for memory copy-on-write is fork()+execve(). A shell, web server, or supervisor creates a child with fork(), the child changes a few file descriptors or environment variables, and then calls execve() to replace its address space with a different program.

In that sequence the child often writes almost nothing to the inherited anonymous memory. Most of the parent's heap, stack pages, and mapped library pages are never privately copied. They remain shared for a brief interval and then disappear when execve() installs the new image. If Linux had to duplicate the full parent address space eagerly on every fork(), process launch would be much more expensive, especially for large processes such as language runtimes, browsers, and JVM-based tools.

This matters in ordinary operational work. A large Python service in Frankfurt that uses subprocess.Popen() to launch helper processes benefits from CoW even if nobody thinks about it. The child gets a new process image cheaply because the cost of the parent's large heap is mostly deferred and often never paid.

It also explains why writing to the heap between fork() and execve() is risky in multi-threaded programs. POSIX strongly limits what is safe in that interval, because only the thread that called fork() survives in the child, while library locks and heap state may still reflect the pre-fork multi-threaded world. This is not a copy-on-write issue by itself, but the cheapness of fork()+exec() encourages a pattern where almost nothing should happen in the child before execve(), because the whole point is to avoid real duplication and complicated side effects.

File-Backed Mappings, Shared Libraries, And Where CoW Stops Being Anonymous

Not all Linux CoW pages are anonymous heap or stack pages. File-backed mappings can also use copy-on-write semantics. If a process maps a file with MAP_PRIVATE, the file's contents are initially shared as read-only cached pages. Reads come from the page cache. If the process writes to the mapping, the kernel allocates a private anonymous page for the modified data and the file itself remains unchanged.

This is the memory-mapped analogue of "private but lazily copied". The original file-backed page is shared. The first write breaks away into a private page visible only to the writing process.

Shared libraries rely on related sharing logic. The executable text of libc.so, OpenSSL, or a browser engine is mapped read-only and executable from the same underlying file into many processes. Those text pages are naturally shareable because nobody should be writing to them. Relocations and mutable data sections complicate the story, though. Historically the dynamic linker had to apply relocations that dirtied certain pages in each process, turning them private. Techniques such as RELRO and better relocation models reduce some of that cost, but the general point remains: parts of a mapped file can remain widely shared, while writable private parts diverge through CoW.

This is one reason "RSS" often exaggerates real physical memory pressure while "PSS" is more honest. If a shared library contributes 20 MiB of mappings to 100 processes, each process reports that region in RSS, but the machine does not need 2 GiB of extra RAM for it. The actual physical pages are shared.

Transparent Huge Pages Make CoW More Expensive Per Fault

Ordinary Linux pages are 4 KiB. Transparent Huge Pages, or THP, promote some anonymous regions to 2 MiB pages to reduce TLB pressure and page-table overhead. This is excellent for sequential scans, large heaps, and some analytics workloads. It changes the economics of copy-on-write, though.

If a parent and child share a 2 MiB huge page after fork(), the first write by one side can no longer be resolved with a cheap 4 KiB copy unless the kernel splits the huge page first. Linux often has to split the THP into ordinary 4 KiB pages before handling the write fault or otherwise duplicate a much larger region than the user actually touched.

That means copy-on-write faults on THP-backed memory can be:

  • slower
  • more bursty
  • more likely to fragment memory
  • more visible in latency-sensitive programs

This matters for pre-forked servers and large runtimes. A parent process that builds a large in-memory cache, then forks workers, may appear to share memory beautifully until background activity dirties enough pages to force CoW splits and copies. Memory usage then rises faster than expected.

You can see THP state on Linux with:

rtk grep -n "AnonHugePages|THPeligible" /proc/$PID/smaps
rtk read /sys/kernel/mm/transparent_hugepage/enabled

For some workloads THP is a clear win. For others, especially those that rely on sharing pages across fork() boundaries, it can increase CoW cost enough that operators disable it or use madvise() selectively.

KSM, Deduplication, And Copy-on-Write In Reverse

Linux also has a feature called Kernel Samepage Merging, or KSM. It scans anonymous pages from different processes looking for identical contents. When it finds duplicates, it merges them into one shared read-only page and marks the mappings copy-on-write. The next write by any process breaks the sharing.

This is almost the mirror image of fork(). Instead of starting shared and diverging later, processes start separate and KSM discovers they can be shared after the fact.

KSM is useful in environments such as virtualisation, where many guests may hold identical zero-filled or common library pages. It trades CPU time for lower memory usage. The mechanism is still copy-on-write, though:

  • find identical pages
  • keep one physical copy
  • remap all participants to it read-only
  • break away on first write

The pattern matters because it shows CoW is not tied to process creation. It is a general strategy for safe sharing plus deferred divergence.

When Linux Copy-on-Write Hurts, Fork Storms, Dirty Heaps, And Pinned Pages

Copy-on-write is not automatically good. It is a powerful optimisation when the system expects many pages to remain shared. It becomes expensive when reality diverges.

One common pathology is a dirty post-fork heap. A large process forks workers and then both parent and children continue mutating many pages in what used to be the shared heap. The result is a storm of CoW faults and a large burst of memory copying. Databases, language runtimes, and prefork web servers have all run into this in different forms.

Another issue is pinned pages. Some kernel subsystems, RDMA setups, or user-space mechanisms such as get_user_pages() pin pages so they cannot be migrated or reclaimed easily. Copy-on-write around pinned pages becomes more complicated because the kernel must preserve guarantees to the pin holder while still giving the writer a coherent private copy.

A third issue is plain page-table duplication overhead. Even if most data pages stay shared, a process with an enormous sparse address space still has to duplicate the page tables on fork(). Programs with millions of VMAs or vast arenas can make fork() slow enough that libraries start preferring posix_spawn(), which on Linux often uses lighter mechanisms in the kernel.

There is no contradiction here. Copy-on-write is an optimisation for specific patterns, not a universal free lunch.

Docker And Overlay Filesystems Use Related Economics, But The Mechanism Is Different

People often say "Docker uses copy-on-write" as if that were the same thing as Linux fork(). The economics are similar, but the machinery is different.

A container image is usually built from a stack of read-only layers. A running container adds a writable upper layer on top. With OverlayFS, the kernel presents a merged view:

  • lowerdir: image layers, read-only
  • upperdir: container's private writable changes
  • workdir: scratch space for the overlay implementation
  • merged: what the process inside the container sees

If a process inside the container reads /usr/bin/python, the file may come directly from a lower image layer shared by many containers. No duplication is needed. If the container writes a file that already exists in a lower layer, OverlayFS performs a copy-up. It copies the file or relevant metadata from the lower read-only layer into the upper writable layer and applies the modification there. The lower layer remains unchanged.

This is "copy on first write" in the storage namespace sense. It is not driven by the MMU, PTE write-protect bits, or a user-mode page fault. It is driven by filesystem operations and overlay metadata.

You can see the structure directly:

rtk proxy mount | rtk grep overlay
rtk proxy docker inspect <container-id> | rtk grep -n "UpperDir|LowerDir|MergedDir"

An illustrative OverlayFS mount looks like:

overlay on /var/lib/docker/overlay2/.../merged type overlay \
  (rw,lowerdir=/var/lib/docker/overlay2/l/...,
   upperdir=/var/lib/docker/overlay2/.../diff,
   workdir=/var/lib/docker/overlay2/.../work)

The important distinction is that container-layer CoW happens at the file or directory-entry level. Linux process-memory CoW happens at the page level. Both defer copying until mutation, but they live in different subsystems and trigger at different moments.

Container Copy-Up Has Performance Costs Of Its Own

The copy-up path in OverlayFS is not free. Modifying a file that exists in a lower layer may require:

  • creating parent directories in the upper layer
  • copying file metadata
  • copying the whole file's contents into the upper layer
  • creating whiteouts if deletions must hide lower entries

If the lower-layer file is large, the first write can be expensive. This is why container best practices often try to keep mutable paths, caches, and application data in dedicated writable volumes rather than repeatedly rewriting files that arrived in image layers.

For example, if a Java application image bakes in a 300 MiB unpacked directory and then modifies one file in place on first boot, the overlay may need to copy up much more than the operator expected. The "write penalty on first mutation" here is structurally similar to memory CoW. The granularity is just much larger.

This is also one reason image-layer design matters. Layering immutable binaries separately from mutable runtime state keeps copy-up costs predictable.

Btrfs And ZFS Snapshots Use Block-Pointer Copy-on-Write

Snapshotting filesystems such as Btrfs and ZFS use copy-on-write in a much more literal storage-engine sense. Existing blocks are never overwritten in place by ordinary updates. Instead, new data is written to new blocks, and metadata is updated to point at the new locations. A snapshot is simply an older root in that tree of block pointers, still referring to the old blocks.

Imagine a simple file whose contents live in block 700 and whose inode lives in metadata block 120. At 09:00 the filesystem creates a snapshot. The live filesystem root and the snapshot root both point, through metadata, at inode block 120 and data block 700.

At 09:05 the live filesystem appends data to the file. With CoW semantics the filesystem does not overwrite block 700. It allocates a new data block, perhaps 931, writes the new version there, allocates updated metadata blocks that point to 931, and then commits a new root for the live filesystem. The snapshot root still points at the old metadata and old data block 700. The two views diverge cleanly.

The old block stays valid for as long as any snapshot still references it.

That gives snapshots their efficiency. Creating a snapshot does not mean copying every block in the dataset. It means creating another consistent metadata root that initially shares all existing block references. Later writes pay the cost by allocating new blocks.

The parallel to fork() is strong:

  • snapshot root and live root share old blocks
  • first write allocates a new private block
  • old references remain valid for readers of the old view

The implementation details are storage-centric rather than MMU-centric, but the high-level economy is unmistakable.

Why Snapshotting Filesystems Need More Than Data-Block CoW

A storage CoW design lives or dies by metadata correctness. If a filesystem only wrote new data blocks but updated metadata in place, a crash could still leave the structure inconsistent. Btrfs and ZFS therefore propagate copy-on-write up the tree.

For example, changing one file block may require:

  1. write a new data block
  2. write a new leaf metadata block referencing that data block
  3. write a new internal metadata block referencing the new leaf
  4. write a new root pointer or transaction-group commit record

This is one reason snapshotting filesystems often pair naturally with checksums and transactional group commits. The model is not "overwrite block 700 carefully". The model is "build a new valid tree and then make that tree the current root". Older trees remain readable until nothing references them.

That is also why random write amplification can be higher on CoW filesystems, especially for databases that like in-place page updates. Changing a small part of a file may cascade into extra metadata writes and fragmented layout. Operators who put high-write OLTP systems on Btrfs or ZFS usually pay close attention to record sizes, snapshots, compression, and database-specific tuning because the CoW semantics are valuable but not free.

Snapshot Send, Replication, And Why CoW Helps Backups

Snapshotting filesystems gain more than cheap local snapshots. Because the old and new trees share most blocks, the system can often compute an efficient incremental stream between snapshots. ZFS send/receive and Btrfs send exploit exactly this. The filesystem already knows which blocks and metadata changed between roots, so it can serialise a compact delta.

This again resembles process-memory CoW in spirit. Two versions share most state. The system tracks divergence through later writes. That makes delta transmission possible because the shared base is explicit.

For operational work this is a big deal. A machine in Vienna can create frequent snapshots, replicate only changed blocks to another host, and keep several historical versions without copying entire volumes each time. The lazy duplication economy turns into a backup and recovery advantage.

Databases, MVCC, And Why People Mention CoW Even When The Engine Is Different

Database MVCC, multi-version concurrency control, often gets described as copy-on-write because old and new versions of data can coexist. The analogy is useful, but it can mislead if taken too literally.

In PostgreSQL, an update does not overwrite a tuple in place. It creates a new tuple version and marks the old one obsolete for future transactions once visibility rules allow it. Readers with older snapshots can still see the old version. Newer readers see the new tuple. This feels CoW-like because the system keeps both versions alive until they are no longer needed.

In InnoDB, the clustered index page may be updated in place, but undo records preserve enough old information that transactions can reconstruct earlier versions according to their read view. That is MVCC too, but the physical storage path is not simply "allocate a fresh page on first write".

SQLite in WAL mode again uses a different pattern. The base database file remains untouched while new page versions are appended to the WAL. Readers keep using the old consistent view until checkpoints merge newer pages back. That also resembles CoW economics because the old state remains readable while the new state is written elsewhere.

The shared idea is version preservation under concurrent reads and writes. The physical mechanisms vary:

  • PostgreSQL: new tuple version plus visibility metadata
  • InnoDB: current page plus undo log for old versions
  • SQLite WAL: old main-file page plus newer WAL frame
  • LMDB: whole B+tree pages copied and new roots committed, much closer to storage-level CoW

So yes, databases are part of the copy-on-write family in the broad engineering sense, but each engine chooses a different granularity and different place to store the "old reality" versus the "new reality".

WAL And CoW Solve Different Problems, But They Often Meet

Write-ahead logging and copy-on-write get mentioned together because both help with safe updates, yet they solve different problems.

WAL is primarily about crash recovery. Before a database changes persistent pages, it writes enough log information that recovery can replay or undo the update after a crash.

Copy-on-write is primarily about version preservation and deferred duplication. Rather than overwriting the only copy of something, the system keeps the old version reachable and writes a new version elsewhere or on demand.

Some systems combine both. ZFS uses transaction groups and checksummed CoW trees rather than a classic database-style WAL for user data, though it still has a separate intent log for specific synchronous semantics. PostgreSQL uses WAL but not a pure CoW page architecture. LMDB is much closer to textbook storage CoW, where updates build a new B+tree path and commit by publishing a new root page. RocksDB uses an LSM design with memtables and immutable SSTables, again a different family.

The practical takeaway is this: when someone says "it is copy-on-write, like a database", ask which layer and which granularity. Are we talking about tuple versions, undo chains, appended pages, snapshot roots, or actual page-table CoW? The phrase alone is not enough.

SQLite WAL Mode Is A Useful Bridge Between Filesystems And Databases

SQLite's WAL mode is one of the cleanest bridges between database recovery and CoW-style versioning. The main database file stays in a stable old state. Writers append new page images to the WAL file. Readers keep using the database file plus an appropriate visibility point in the WAL. A checkpoint later copies newer page images back into the main file.

That arrangement means:

  • readers do not block writers the way they would under a rollback journal
  • the old database image remains readable during the write burst
  • the first durable write lands in the append-only WAL

It is not identical to filesystem block CoW, but it captures some of the same operational benefits. Old state remains available. New state is written elsewhere first. Consolidation happens later.

This is one reason SQLite WAL mode is easy to explain to people who already understand snapshotting filesystems. The base file is the old tree, the WAL contains newer page versions, and the checkpoint is the moment the system folds those versions back into the canonical file.

LMDB Is One Of The Purest Database Copy-on-Write Designs

If you want a database that is structurally closest to filesystem-style CoW, LMDB is a good example. It stores data in a memory-mapped B+tree. A write transaction does not mutate the old tree in place. It copies the path of pages that change, updates pointers upward, and commits by publishing a new root. Readers keep using the old root until they finish.

That has several consequences:

  • readers are cheap
  • a consistent snapshot is implicit in the root they started from
  • no traditional WAL is needed for page recovery in the same way as PostgreSQL
  • free-space reuse and long-lived readers require care because old pages remain pinned until nobody references them

LMDB shows the most literal database version of copy-on-write: preserve the old tree, build a new tree path, flip the root atomically.

It is a useful mental counterpoint to PostgreSQL and InnoDB because it makes clear that "MVCC database" and "copy-on-write database" are related categories, not synonyms.

The First Write Is The Tax, Granularity Decides How Painful It Feels

Across all these systems, the first write after sharing is the expensive moment. The size of that tax depends on granularity.

In Linux process memory, the tax is usually one 4 KiB page copy, unless huge pages complicate things.

In OverlayFS, the tax may be copying an entire file into the upper layer.

In ZFS or Btrfs, the tax may be a new data block plus several metadata blocks up the tree.

In PostgreSQL MVCC, the tax is a new tuple version plus WAL and later vacuum cleanup.

In LMDB, the tax is copying the modified B+tree path.

This is the unifying operational question. Not "is it copy-on-write?" but "what is the unit of duplication when the first write happens?" That unit determines latency spikes, write amplification, fragmentation risk, and memory or storage growth.

Debugging And Observability, How To Tell When CoW Is Actually Happening

For Linux process memory, the most useful tools are:

rtk read /proc/$PID/smaps
rtk proxy perf stat -e minor-faults,major-faults ./workload
rtk grep -n "AnonHugePages|Private_Dirty|Shared_Dirty" /proc/$PID/smaps

smaps shows whether pages are shared or private. perf stat reveals fault pressure. minor-faults rising during post-fork writes usually means the workload is paying CoW costs.

For container layers:

rtk proxy docker inspect <container-id>
rtk proxy du -sh /var/lib/docker/overlay2/*/diff
rtk proxy mount | rtk grep overlay

Growth in the upper diff directory indicates copy-up activity.

For Btrfs and ZFS:

rtk proxy btrfs subvolume list /
rtk proxy btrfs filesystem df /
rtk proxy zfs list -t snapshot
rtk proxy zfs get written pool/dataset@snap1

Snapshot counts and "written since snapshot" statistics tell you how much divergence has accumulated.

For PostgreSQL and SQLite:

rtk proxy psql -c "SELECT * FROM pg_stat_bgwriter;"
rtk proxy psql -c "SELECT relname, n_dead_tup FROM pg_stat_user_tables ORDER BY n_dead_tup DESC LIMIT 10;"
rtk proxy sqlite3 db.sqlite "PRAGMA journal_mode;"
rtk proxy sqlite3 db.sqlite "PRAGMA wal_checkpoint(PASSIVE);"

These are not direct CoW counters in the MMU sense, but they show the health of the version-preserving machinery and the cleanup work it induces.

Why posix_spawn() Exists Even Though fork() Has Copy-on-Write

If copy-on-write makes fork() cheap, it is reasonable to ask why modern runtimes and C libraries increasingly prefer posix_spawn() for launching subprocesses.

The short answer is that copy-on-write removes the cost of copying data pages, but it does not remove every other cost of cloning a large process. A big process can still have:

  • many VMAs to duplicate
  • large page-table trees to walk and copy
  • many file descriptors to audit or rearrange
  • multi-threaded state that makes the post-fork child path delicate
  • allocator metadata that is awkward to touch safely before execve()

posix_spawn() gives the implementation more freedom. On Linux, glibc may still use fork() internally in some cases, but it can also take lighter paths and do less fragile work in the child. For languages such as Python, Ruby, and Java, the distinction matters because the parent process can be large, multi-threaded, and loaded with runtime state that should not be copied or manipulated casually in the narrow interval before execve().

This is a useful reminder that copy-on-write optimises one part of process creation, the data-copy path, rather than erasing the full process-launch cost model.

Copy-on-Write Interacts With The Page Cache In Subtle Ways

Anonymous memory and file-backed memory behave differently once writes begin. A file-backed MAP_PRIVATE mapping starts life backed by the page cache. Reads share ordinary file pages. The first write does not mutate the page-cache copy that other processes may still use. Instead, Linux allocates a new anonymous page and remaps that process's virtual page to the private anonymous copy.

That means one virtual region can change identity over time:

  1. first it points at a file-backed page-cache page
  2. then one process writes
  3. after the fault it points at an anonymous private page

This matters for tools and performance interpretation. Two processes reading the same mapped file can share physical pages nicely. Once one of them writes through a private mapping, some of those pages are no longer part of the shared page cache for that process. They have become anonymous memory with different reclaim and accounting behaviour.

It also matters for storage-heavy software that memory maps large files, changes small parts, and assumes the changes remain closely tied to file IO. Once private faults happen, the memory path and the file path diverge. The modified pages are now private anonymous copies until msync() or normal writeback semantics push changes through the relevant mechanism.

That distinction is one reason memory-mapped IO looks simple in source code but can be trickier in production. A read-mostly private mapping can be highly shareable. A write-heavy private mapping can quietly turn into lots of anonymous dirty memory plus later writeback work.

Why Databases On Snapshotting Filesystems Need Extra Care

Databases often already have their own version-preserving and crash-recovery machinery. PostgreSQL has WAL plus MVCC. InnoDB has redo plus undo plus its own page layout. Put one of these engines on top of a snapshotting CoW filesystem and you now have two different layers preserving history and redirecting writes.

That can work well, but it changes the write profile in ways operators have to respect.

Suppose PostgreSQL updates one 8 KiB heap page on a ZFS dataset. PostgreSQL may:

  • generate WAL for the change
  • dirty the corresponding data page
  • later write the updated page out

ZFS then may:

  • allocate a new block for the changed page rather than overwriting the old block
  • update indirect metadata blocks on the path to the root
  • preserve the old blocks for any snapshots that still reference them

The stack is durable, but the write amplification can be real. The database thinks in page updates plus WAL. The filesystem thinks in new blocks plus metadata-root updates. Neither layer is wrong. They just have different invariants and both get to charge a write tax.

This is why operators on Btrfs and ZFS pay attention to:

  • record or block sizes
  • snapshot churn
  • compression settings
  • database checkpoint and fsync cadence
  • whether the workload is append-heavy, update-heavy, or random-write-heavy

The same broad "preserve old versions until it is safe to retire them" idea is present in both layers. The combined cost can still be higher than people expect if they assume one layer's model fully replaces the other's.

Garbage Collection And Snapshot Retention Are The Price Of Old Versions

Copy-on-write does not eliminate cleanup. It postpones some of it and shifts it into reclaim work.

In Linux process memory, old private copies vanish when reference counts drop to zero. The cleanup path is straightforward. Once the parent exits or unmaps a region, and the child has already diverged, the old pages can be freed as soon as nobody references them.

In filesystems and databases, old versions can persist much longer:

  • a snapshot still references the old block
  • a reader transaction still needs the old tuple version
  • an undo record still anchors the historical view
  • an upper filesystem layer still hides a lower file via whiteouts and copy-up state

That means every CoW-capable design needs a retirement policy. In PostgreSQL it is vacuum. In ZFS and Btrfs it is snapshot deletion plus the space-reclamation machinery that follows. In overlay filesystems it is container teardown and layer garbage collection. In LMDB it is freeing pages no active reader still pins through an older root.

This is one of the places where the cost of old-version preservation becomes operationally obvious. Storage does not return simply because the new version exists. The system must also prove the old version is no longer reachable by any valid reader or snapshot.

Cloud Snapshots Feel Instant Because They Are Pointer Tricks First

Cloud block-volume snapshots from providers such as AWS EBS, Azure Managed Disks, or similar services are often described as instant. They are not instant in the sense of physically copying every block at that moment. They are instant because the provider records a new logical point-in-time view and then lets later writes diverge from that baseline.

This is once again the same deferred-duplication bargain:

  • keep the existing blocks as the old view
  • let new writes allocate or reference new storage internally
  • track which later blocks belong to the live volume versus the snapshot chain

The user experiences a fast snapshot operation because the full copy is not done eagerly. The provider can then replicate or compact the changed blocks later.

CoW Changes Failure Domains As Well As Performance

One quiet advantage of copy-on-write designs is that they often preserve an older valid version while a newer one is still being assembled. In process memory this mostly matters for correctness during a page-fault handoff. In filesystems and databases it matters much more because it changes what a crash can destroy.

If a system overwrites the only copy of a block in place, a torn write or partial metadata update can corrupt the one structure every reader depends on. If the system writes a new version elsewhere and only later publishes a new root or pointer, the old version can survive the interruption intact. The publication step becomes the critical moment rather than the data overwrite itself.

This is one reason snapshotting filesystems, append-first database modes, and root-flip designs often pair naturally with integrity checks and transactional commits. Deferred duplication does not only save work up front. It can also preserve a previously valid view long enough that recovery has something trustworthy to fall back to.

This example matters because it reinforces how broad the pattern is. A local Linux fork(), a ZFS snapshot, and a cloud block snapshot all feel different in daily use, but the optimisation is recognisably related. Share a stable baseline now. Pay the full duplication cost only where later writes make sharing impossible.

The Mental Model That Usually Holds Up

A reliable mental model is:

  1. the system starts with one canonical version of some state
  2. another consumer, snapshot, or process wants to observe that state without paying for a full duplicate immediately
  3. the system shares the existing representation and records enough metadata to know that it is shared
  4. the first writer pays the tax by allocating a private replacement at whatever granularity the subsystem uses
  5. old readers keep seeing the old version until they no longer need it

That model applies to:

  • Linux anonymous pages after fork()
  • MAP_PRIVATE file mappings
  • OverlayFS copy-up
  • Btrfs and ZFS snapshots
  • SQLite WAL mode in a broad version-preserving sense
  • MVCC row or page versioning, depending on the engine

It does not mean the implementations are interchangeable. The hardware fault path of Linux CoW and the metadata-tree updates of ZFS are worlds apart. The model is about deferred duplication economics, not about one shared code path.

What Most People Get Wrong About Copy-on-Write

Three misunderstandings show up repeatedly.

The first is assuming CoW means "no copy". It does not. It means "copy later, only if needed". The copy may still be large and expensive. The whole design is a bet that many objects will never diverge, or that deferring the tax improves aggregate performance.

The second is assuming every system that preserves old versions is doing the same thing. A PostgreSQL tuple chain, an InnoDB undo log, a Linux CoW page fault, and a Btrfs snapshot all preserve history differently. The phrase is useful only if you also name the layer.

The third is assuming CoW is always beneficial. It is often excellent. It can also produce heavy first-write latency, fragmentation, delayed cleanup work, and unpleasant surprises when a supposedly shared structure becomes hot and mutable.

The Most Practical Way To Use The Idea

When you hear "copy-on-write" in design discussions, ask four precise questions:

  1. what object is being shared first, page, file, block, tuple, or tree path?
  2. how is the sharing tracked, refcounts, metadata roots, overlay entries, undo records?
  3. what exact event triggers divergence?
  4. what is the size of the first-write tax?

Those four questions almost always reveal whether the design fits the workload.

For process creation on Linux, the answers are favourable. Pages are small, divergence often never happens, and fork()+exec() benefits massively.

For container layers, the answers depend on file sizes and write patterns.

For snapshotting filesystems, the answers depend on fragmentation tolerance, snapshot count, and write locality.

For databases, the answers depend on concurrency model, cleanup policy, and crash-recovery design.

The Right Place To Leave It

Copy-on-write is not a single feature. It is a recurring engineering move. Share first, duplicate only when mutation proves that sharing has reached its limit.

On Linux that move appears in fork(), page tables, MAP_PRIVATE, KSM, and sometimes huge-page corner cases. In containers it appears as layer sharing and copy-up. In Btrfs and ZFS it appears as old roots and new blocks. In databases it appears as retained old versions, appended page images, undo chains, or fully new tree paths depending on the engine.

The common value is economy. The common risk is that the first write is where deferred work becomes visible.

Once you see both sides, the phrase stops sounding like a buzzword and starts describing a concrete trade. Keep old state shareable. Make the write path pay when divergence actually happens. That trade is one of the reasons modern systems can launch processes quickly, snapshot huge datasets cheaply, run containers densely, and let readers observe consistent old views while writers keep moving.

The details change by layer. The bargain stays the same.