05-04-2026

How Virtual Memory Actually Works

Try the interactive lab for this article Take the quiz (6 questions · ~5 min)

Every process on your machine believes it owns the whole address space. A Firefox tab in Berlin mapping a JavaScript heap at address 0x7f3c00000000 does not know, and does not need to know, that the IntelliJ instance running next to it also has something at that address. They cannot see each other's memory. They cannot even describe it, because in each of their private worlds the word "memory" points at entirely different physical bytes.

This illusion is called virtual memory, and it is one of the oldest and most load-bearing tricks in computing. Programs are written against a flat, private, enormous address space. The hardware translates every single load and store from that imaginary space into a real physical address in DRAM, transparently, at cache-line speed, billions of times per second. When there is no physical memory to back a page, the kernel fills one in on demand. When the same data is needed by two processes, the kernel quietly points them at the same physical page and marks it read-only. When the physical RAM fills up, the kernel pushes cold pages out to swap and pretends nothing happened.

Most of the time, virtual memory is invisible. You allocate with malloc, you get a pointer, you dereference it, the bytes are there. But the moment performance matters, or you start measuring latency tails, or you wonder why your 16 GB laptop just swapped to disk while using 4 GB, the abstraction starts leaking. This article explains what is actually happening under the pointer, with enough detail that the leaks start making sense.

Why Virtual Memory Exists At All

In the late 1950s, programs ran on physical addresses. A programmer on an IBM 704 in a Parisian lab would write code that directly referenced core memory at address 020100, and if two programs needed the same address at the same time, one of them lost. There was no isolation. There was no way to run a program larger than physical memory. There was no way to relocate a binary after loading. Every hardcoded address was a hand grenade with the pin pulled.

Virtual memory solves four problems at once, and this is why it became universal rather than optional.

Isolation. Two processes can dereference the same pointer value and land on different physical bytes. One process cannot read or write another's memory unless the kernel explicitly sets that up. This is the foundation of multi-user operating systems and every modern sandbox.

Relocation. The linker can pretend every binary loads at a fixed base address. The kernel and the dynamic loader then map that fixed virtual address to whatever physical frames happen to be free. Address Space Layout Randomisation (ASLR) takes this further by picking a random virtual base on every exec, which makes buffer overflow exploitation much harder.

Overcommit. A process can reserve a 1 TB virtual region even if the machine only has 16 GB of RAM. No physical memory is allocated until a page is actually touched. This is how things like sparse arrays, memory-mapped databases, and JVM heaps can exist on modest hardware.

Swapping and demand paging. If physical RAM is scarce, cold virtual pages can be written to disk and evicted from RAM. When the process touches them again, the kernel faults them back in. From the program's point of view, its entire working set is always in memory. From the kernel's point of view, only the currently-hot fraction is.

All four of these properties hang off one piece of hardware machinery: the Memory Management Unit (MMU), and the data structure it consults on every memory access, the page table.

The Virtual Address On x86_64

Pick an instruction at random from a running Linux process. Say it is a load from address 0x00007f3c45a72820. The CPU cannot issue this address directly to the memory controller. It must first translate it into a physical address, and it does this by walking a page table.

On x86_64, the hardware supports several paging modes, but the one every modern desktop and server Linux uses is 4-level paging with 4 KiB pages and 48-bit virtual addresses. A virtual address is carved up like this:

 63      48 47    39 38    30 29    21 20    12 11      0
+----------+--------+--------+--------+--------+----------+
| sign ext | PML4   | PDPT   | PD     | PT     | offset   |
|          | index  | index  | index  | index  | in page  |
+----------+--------+--------+--------+--------+----------+
  16 bits    9 bits   9 bits   9 bits   9 bits  12 bits

The top 16 bits are sign-extended copies of bit 47. If bit 47 is 0, the top bits must all be 0 (user space, roughly 0x0000000000000000 to 0x00007fffffffffff). If bit 47 is 1, the top bits must all be 1 (kernel space, 0xffff800000000000 and up). Addresses with inconsistent top bits are "non-canonical" and generate a general protection fault. This splits the 2^48-byte virtual address space into two halves, user and kernel, with an enormous forbidden gulf between them. Userspace gets 128 TiB. Kernel space gets 128 TiB. The gulf exists because otherwise pointer bugs and kernel ROP attacks would be much easier.

The remaining 48 bits are an index into four levels of page tables, plus a 12-bit offset into the final 4 KiB physical page. Each page table is itself exactly one 4 KiB page, holding 512 entries of 8 bytes each. 9 bits of virtual address pick the entry, and each entry points at the next-level table or at the final data page.

To translate 0x00007f3c45a72820:

binary of 0x00007f3c45a72820 =
0000 0000 0000 0000 0111 1111 0011 1100
0100 0101 1010 0111 0010 1000 0010 0000
 
PML4 index (bits 47-39)  = 000 1111 1110 = 0xFE    = 254
PDPT index (bits 38-30)  = 0 1111 0001 0 = 0x0F2   = 242
PD   index (bits 29-21)  = 001 0110 100 = 0x0B4    = 180
PT   index (bits 20-12)  = 0 1110 0101 0 = 0x0E5   = 229
offset (bits 11-0)       = 1000 0010 0000 = 0x820  = 2080

The CPU starts from the physical address held in control register CR3. CR3 points at the top of the 4-level tree, the PML4 table for the current process. It reads entry 254 of that page, which gives it the physical address of a PDPT. It reads entry 242 of the PDPT, which gives it the physical address of a PD. It reads entry 180 of the PD, which gives it the physical address of a PT. It reads entry 229 of the PT, which gives the physical frame number of a 4 KiB page of actual data. Finally it adds offset 0x820 and issues that as the real physical memory access.

Four dependent memory loads per translation. If every translation really took four DRAM accesses, even a simple loop would run at a crawl. Which is why the CPU caches them aggressively.

The Translation Lookaside Buffer

Every modern CPU has a small, fast, content-addressable cache of recently used page table entries called the Translation Lookaside Buffer. The TLB maps virtual page number directly to physical frame number, skipping the whole four-level walk.

A typical Intel Zen or Raptor Lake core has multiple TLBs, organised in levels just like data caches.

L1 DTLB: around 64 to 96 entries, split by page size (4 KiB, 2 MiB, 1 GiB). Latency: 1 cycle.
L1 ITLB: similar size for instructions.
L2 TLB: shared between data and instructions, around 2,048 to 3,072 entries. Latency: a few cycles.

64 entries for 4 KiB pages is 256 KiB of reachable memory in the L1 DTLB. 2,048 entries is 8 MiB. Beyond that, the CPU starts walking page tables on TLB misses, and every miss is four cacheable reads. If the page tables themselves are hot in L1 or L2 data cache, the walk takes tens of cycles. If the page tables are cold and must be fetched from L3 or DRAM, a single TLB miss can cost hundreds of nanoseconds.

This is why TLB behaviour dominates the performance of memory-heavy workloads. Random access over a few megabytes of working set can be fine. Random access over a few gigabytes of working set, with 4 KiB pages, blows out the TLB and adds a page walk to every single access. The solution is huge pages, which we will come back to.

On a context switch, things get awkward. The new process has a different virtual-to-physical mapping. The TLB entries for the old process are now wrong. Without help, the kernel would have to flush the entire TLB on every context switch by writing CR3, which would make context switches eye-wateringly expensive.

The help is PCID (Process Context Identifier), an x86_64 feature Linux uses aggressively since around 2018. Each TLB entry is tagged with a 12-bit PCID that identifies which process it belongs to. On context switch, the kernel writes CR3 with the new page table base and the new PCID, and the CPU simply ignores TLB entries with different PCIDs rather than flushing them. Returning to the old process later finds its TLB entries still present. PCID turned what used to be a hard cost into a very cheap one.

Walking the Tables In Anger

It helps to look at a page table entry directly. On x86_64, each 8-byte entry looks like this:

63        52 51                                12 11    9 8 7 6 5 4 3 2 1 0
+-----------+-----------------------------------+-------+-+-+-+-+-+-+-+-+-+
| reserved  | physical frame number             | avail |G|S|D|A|C|W|U|W|P|
| / NX bit  | (bits 51..12)                     |       | |/| | |D|T|/|/| |
|           |                                   |       | |P| | | | |S|R| |
+-----------+-----------------------------------+-------+-+-+-+-+-+-+-+-+-+

The lower 12 bits are flags. Bit 0 (P) is "present"; if it is clear, the whole entry is empty and touching the page raises a fault. Bit 1 (W) controls whether writes are allowed. Bit 2 (U/S) controls whether user mode can access it at all. Bit 5 (A, "accessed") is set by the CPU whenever the page is touched. Bit 6 (D, "dirty") is set on write. Bit 7 (PS) on non-leaf entries promotes them to huge page leaves. Bit 8 (G) marks the entry global, so TLB flushes skip it, used for kernel mappings. Bit 63 (NX, "no execute") prevents code execution from the page.

The middle bits hold a 40-bit physical frame number. Multiply it by 4 KiB and you get the physical address of either the next-level table or, at the leaf, the actual data page. Physical addresses on current x86_64 chips do not actually use all 52 bits; real machines have around 46 bits of physical addressing, plenty for exabytes.

Linux defines a struct page for every 4 KiB physical frame in RAM. It represents the frame's reference count, its LRU position, its page cache state, and a pointer back to whatever is using it. On a machine with 64 GiB of RAM, there are 16 million struct page objects, each around 64 bytes, consuming roughly 1 GiB just for the bookkeeping. This is why extremely memory-rich systems sometimes turn to alternatives like ZONE_DEVICE.

You can inspect the page tables of a live process via /proc/PID/pagemap and /proc/kpageflags. A short C snippet that resolves a virtual address to its physical frame looks like this:

#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
 
int main(int argc, char **argv) {
    if (argc != 2) return 1;
    uintptr_t vaddr = strtoull(argv[1], NULL, 0);
    size_t page_size = sysconf(_SC_PAGESIZE);
    uintptr_t vpn = vaddr / page_size;
 
    int fd = open("/proc/self/pagemap", O_RDONLY);
    if (fd < 0) return 2;
 
    uint64_t entry;
    if (pread(fd, &entry, sizeof(entry), vpn * sizeof(entry)) != 8) return 3;
    close(fd);
 
    if (!(entry & (1ULL << 63))) {
        printf("page not present\n");
        return 0;
    }
    uint64_t pfn = entry & ((1ULL << 55) - 1);
    uint64_t paddr = pfn * page_size + (vaddr & (page_size - 1));
    printf("virt 0x%lx -> phys 0x%lx (pfn 0x%lx)\n", vaddr, paddr, pfn);
    return 0;
}

On most distributions, reading physical frame numbers from pagemap now requires CAP_SYS_ADMIN, because leaking them would defeat ASLR and open up Rowhammer-like attacks. This is a direct consequence of the memory model hardening that happened across Linux kernels around 2015 to 2018.

The Page Fault

Virtual memory is cheap because most of the time, the translation is already in the TLB or the page tables and the CPU just flows through. The interesting cases are when it does not. Any time a memory access cannot be completed, the CPU raises a page fault, saves the faulting address into register CR2, pushes an error code onto the stack, and jumps into the kernel's page fault handler at the address the IDT (Interrupt Descriptor Table) entry for vector 14 points at.

On Linux, control lands in do_page_fault, which reads CR2 and the error code, figures out what was intended, and either fixes up the tables or decides the process must be killed. Broadly, faults come in three flavours.

Hard faults. The page is valid from the kernel's point of view but not currently in RAM. This happens when the page is in swap, or when the page is a memory-mapped file whose contents have not been read yet, or when the page was never allocated because the process reserved virtual memory without touching it (anonymous lazy allocation). The kernel finds a free physical frame, populates it from disk or zeroes it, installs the mapping, and returns from the fault. The instruction retries transparently.

Soft faults. The physical page exists and is already in memory, but there is no page table entry mapping it yet. This is common during fork, when the child shares most pages with the parent via copy-on-write. A soft fault also happens for pages already cached in the page cache: the data is sitting in RAM waiting, and the kernel only needs to create a PTE pointing at it. Soft faults are fast.

Invalid faults. The access is genuinely illegal. Writing to a read-only page, executing a no-execute page, dereferencing an address that was never mapped. These result in SIGSEGV for the process, or a kernel oops if the fault came from kernel code.

A typical fault path looks like this, simplified:

void do_page_fault(unsigned long addr, unsigned long error_code)
{
    struct vm_area_struct *vma = find_vma(current->mm, addr);
    if (!vma || addr < vma->vm_start)
        return bad_area(addr); /* SIGSEGV */
 
    if ((error_code & WRITE) && !(vma->vm_flags & VM_WRITE))
        return bad_area(addr); /* wrote to read-only page */
 
    if (handle_mm_fault(vma, addr, flags) == VM_FAULT_OOM)
        return out_of_memory();
}

The logic hinges on vm_area_struct, or VMA. Every process has a set of VMAs describing its address space: one for each mmap, one for the heap, one for each stack, one for the vDSO. A VMA has a start address, an end address, a protection mask, and a handler object that knows how to populate pages on demand. When the fault handler looks up addr, it finds the VMA that covers that range, and asks the VMA's fault method to produce a page. For an anonymous VMA the method just allocates a zeroed frame. For a file-backed VMA it reads the relevant offset of the file.

Viewed from userspace, you can watch faults happen with perf:

perf stat -e page-faults,minor-faults,major-faults ./your-program

On a typical startup-heavy workload you will see tens of thousands of minor faults (soft faults, no disk) and a few dozen major faults (hard faults that hit disk). If the major-fault count explodes, your machine is swapping.

Forking And Copy-On-Write

When a process calls fork, the kernel has to create a child process with its own address space that is initially identical to the parent's. Actually copying every physical page would be catastrophic, because fork is often immediately followed by exec, which throws all that work away. The trick Linux has used since the early days is copy-on-write (COW).

At fork time, the kernel duplicates the parent's page tables but marks every writable page in both parent and child as read-only, and bumps each physical page's reference count to 2. The TLB is flushed for both processes. Neither one can write to any of the shared pages without faulting, but reads work fine.

The instant either process writes to a COW page, the CPU raises a page fault. The kernel's fault handler sees that the underlying page is a COW page with a refcount greater than 1, allocates a new physical frame, copies the old page's contents into it, decrements the original's refcount, updates the faulting process's page table to point at the new frame with write permission, and resumes. Only the pages that actually get written are ever copied.

This has strange implications. A forked child that calls exec almost immediately pays very little memory cost, because almost no pages were ever written between fork and exec. But a forked child that stays alive and writes heavily (think a Redis BGSAVE, which forks to snapshot a 20 GB dataset) can end up duplicating large fractions of the parent's memory through COW faults. This is exactly why Redis operators watch the copy_on_write_size field in the logs carefully on busy instances: a fork that takes 200 ms to initiate can cause hours of creeping memory overhead afterwards.

COW also applies to anonymous pages on fork, to mmap'd shared library code (which is never written to at all in practice), and to the zero page. When you malloc a megabyte and never touch it, you get virtual memory but no physical memory. When you calloc a megabyte, Linux gives you a million pointers all pointing at the same physical zero page in read-only COW mode; the first write to any of them faults in a real zeroed frame.

Demand Paging And The Page Cache

Linux is relentlessly lazy about allocating physical memory. When you mmap a file, almost nothing happens. The kernel records the file and the range in a new VMA, installs no page table entries, and returns. When your program reads byte 0 of that region, the CPU raises a page fault. The fault handler walks to the filesystem layer, reads the file's first 4 KiB into a fresh physical frame, installs a PTE, and lets the instruction retry.

The physical frames holding file data are part of the page cache. Every file read, whether via read() or mmap, goes through it. If the same data is requested again, it is already in RAM and no disk IO is needed. If multiple processes map the same file, they all point at the same physical frames in the page cache. This is exactly how shared libraries end up being shared: every process that loads libc.so.6 gets its text pages mapped to the same handful of physical frames, read-only. On a workstation running ten applications, libc may physically exist only once in RAM while being visible in ten different virtual address spaces.

The page cache is also what free -h means by buff/cache. On a Linux machine that has been up for a while, the cache will expand to fill nearly all available RAM, which sometimes panics users coming from Windows. It should not panic anyone. The cache is reclaimable: if a new allocation needs memory, the kernel simply drops clean page cache entries. Dirty pages (modified file data that has not been flushed to disk yet) must be written back first, which is what the background writeback threads do.

The page cache is not just about files. It is also the unit of memory accounting for swap. When RAM gets tight, Linux's page reclaim logic scans the LRU lists looking for pages to evict. Clean file-backed pages are the cheapest to evict because they can be reread from disk. Dirty file pages require writeback first. Anonymous pages (heap, stack) require being written to swap. The decision of which to prefer is controlled by /proc/sys/vm/swappiness: 0 means "prefer to drop file cache", 100 means "prefer to swap anonymous memory out". Defaults vary by distribution, with 60 being traditional.

A short demonstration of demand paging:

dd if=/dev/urandom of=/tmp/big.bin bs=1M count=1024
sync
echo 3 > /proc/sys/vm/drop_caches   # clear the page cache
 
perf stat -e page-faults,minor-faults,major-faults \
  python3 -c 'import mmap
f = open("/tmp/big.bin", "rb")
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
total = 0
for i in range(0, len(m), 4096):
    total += m[i]
print(total)'

The first run will show major faults roughly equal to the file size in pages, because every 4 KiB chunk comes from disk. The second run, without dropping caches, shows only minor faults, because every page is already sitting in the page cache from the first run.

Huge Pages

TLB reach is the dominant constraint on random-access throughput for large working sets. A process that walks a 32 GiB hash table with 4 KiB pages hits the TLB every 4 KiB, and if its working set is larger than 8 MiB (2,048 entries times 4 KiB), nearly every access will miss the L2 TLB and trigger a page walk. Page walks are cacheable, but the overhead is real and adds up.

The fix is to make pages bigger. x86_64 supports 2 MiB pages (at the PD level, by setting bit 7 and making the PD entry point directly at data instead of a PT) and 1 GiB pages (at the PDPT level). One 2 MiB entry in the L2 TLB covers 512 times as much memory as one 4 KiB entry. One 1 GiB entry covers 524,288 times as much. A modest 128-entry 2 MiB TLB covers 256 MiB without a single miss.

Linux supports huge pages through two mechanisms.

HugeTLB pages are a legacy interface. You reserve a fixed pool at boot with hugepagesz=2M hugepages=1024 on the kernel command line, or at runtime via /proc/sys/vm/nr_hugepages. Applications allocate from the pool with mmap(..., MAP_HUGETLB) or shmget(..., SHM_HUGETLB). Databases like PostgreSQL, Oracle, and SAP HANA explicitly request hugetlb because they know their memory requirements in advance and cannot tolerate fragmentation surprises. A Frankfurt-based bank running Oracle on Linux will often have 80% of RAM reserved as 1 GiB pages at boot.

Transparent Huge Pages (THP) are more ergonomic. The kernel automatically tries to back 2 MiB-aligned virtual regions with 2 MiB physical pages, and quietly promotes or demotes pages based on fragmentation and pressure. You enable it with /sys/kernel/mm/transparent_hugepage/enabled, which accepts always, madvise, or never. Most distributions ship with madvise: THP applies only to regions that called madvise(addr, len, MADV_HUGEPAGE). Some distributions use always, which gives invisible wins to any big allocation but can cause stalls under memory pressure when the kernel tries to compact memory to form contiguous 2 MiB regions. Some latency-sensitive systems set it to never to avoid exactly those stalls.

You can check THP effectiveness with:

cat /sys/kernel/mm/transparent_hugepage/enabled
grep AnonHugePages /proc/meminfo
grep -e HugePages_ -e Huge /proc/meminfo

And per-process:

grep -e AnonHugePages /proc/$PID/smaps | awk '{sum+=$2} END {print sum, "kB"}'

The TLB benefit of huge pages on large working sets is often measured in tens of percent of total throughput, which is why every major database engine on Linux supports them and most recommend enabling them.

mmap Versus read

A long-running debate in systems programming is whether to do file IO with read/write or with mmap. Virtual memory is what the debate is really about.

read() copies bytes from the kernel's page cache into a buffer in your address space. Every read is two passes over the data: one from disk or cache into the page cache, one from the page cache into your buffer. For small reads this is fine. For large files it doubles memory bandwidth.

mmap() maps the file's page cache directly into your address space. Your loads hit the same physical frames that the kernel holds for caching. There is no copy. For streaming reads this is a clear win. But mmap has its own costs: every touched page is a potential minor fault, every fault involves kernel work, and the whole mapping participates in the system's page reclaim policy. A process mapping a file larger than RAM will see the kernel pushing file pages in and out in response to memory pressure. You cannot prefetch with simple system calls the way you can with read; you must use madvise(MADV_WILLNEED) or readahead().

Databases have historically done both. SQLite reads. MySQL InnoDB reads. PostgreSQL reads (but keeps its own buffer pool and leans on the kernel for everything else). LMDB is the famous mmap-only outlier. Cassandra and Lucene lean heavily on mmap for their read paths. There is no universally right answer: the choice depends on access patterns, working set sizes, and how much you trust the kernel's eviction policy versus your own.

Swapping And The Memory Pressure Mechanism

When free memory drops below a watermark, Linux starts reclaiming. The watermarks are defined per memory zone in /proc/zoneinfo (min, low, high). When free memory hits low, the background reclaim thread kswapd starts scanning LRU lists looking for pages to evict. When free memory hits min, reclaim becomes synchronous, and any allocation attempt will first do direct reclaim before returning.

The LRU machinery keeps two lists per node: an active list and an inactive list, each for anonymous and file pages. Pages start on the inactive list. If they are accessed again while on the inactive list, they are promoted to the active list. Pages on the active list that age out are demoted back. When the scanner needs victims, it pulls from the inactive lists. This two-list scheme is a crude approximation of "second chance" that tries to distinguish recently-used-twice from recently-used-once pages.

Swap is the place anonymous pages go when they are evicted. A swap partition or swap file is registered with swapon. When an anonymous page is evicted, it is written to a slot in the swap area, and its page table entry is replaced with a special swap entry that encodes the swap slot number. On the next access to that virtual address, the page fault handler sees the swap entry, reads the slot back from disk into a fresh frame, installs a regular PTE, and resumes the process.

Swap was traditionally a sign of imminent disaster: once you started swapping, you were thrashing. Modern machines with NVMe drives (we covered their internals in the SSD article) have made swap surprisingly usable again, because random 4 KiB reads from an NVMe SSD complete in tens of microseconds rather than the tens of milliseconds of a spinning disk. On a machine with fast storage, a modest amount of swap gives the kernel room to evict truly cold pages and keep active pages in RAM, often improving rather than hurting performance.

zswap and zram are memory-efficient alternatives to on-disk swap. zswap compresses anonymous pages and keeps them in a memory pool until pressure forces them to disk. zram uses compressed RAM as a block device and puts the swap file on it. Chromebooks, Raspberry Pi OS, and Fedora on laptops all default to some combination these days, because compressing pages is cheaper than moving them to disk, and the compression ratios on typical heap memory are 2x to 4x.

Dirty Pages And Writeback

File-backed pages in the page cache can be clean (matches what is on disk) or dirty (modified but not yet flushed). Clean pages are easy to evict: drop them, the next access rereads from disk. Dirty pages must be written back first, because evicting them would lose the writes.

Linux runs a set of writeback workqueues that periodically flush dirty pages. Two knobs in /proc/sys/vm control the policy: dirty_background_ratio (the percentage of RAM at which background writeback begins) and dirty_ratio (the percentage at which writing processes are throttled and forced to do writeback synchronously). Defaults are often 10% and 20% respectively. On a 64 GiB workstation this means the kernel can accumulate 12 GiB of dirty pages before any writing thread starts noticing, which on a cold disk can produce long, surprising stalls.

High-throughput services running on spinning disks learned long ago to lower these ratios aggressively, to smooth out the write latency tail rather than letting writeback crash through in multi-gigabyte bursts. A Stockholm fintech running Kafka brokers on bulky HDDs typically sets both ratios to single-digit percentages and relies on the background flusher running almost continuously.

The OOM Killer

What if there is nothing left to evict? If every anonymous page is swapped, every file page is either dirty or pinned, and a new allocation arrives that cannot be satisfied, the kernel has one last move. It invokes the Out-Of-Memory killer.

The OOM killer scans all processes, computes a score for each, and kills the highest-scoring one. The score is a combination of resident set size, run time, and an adjustment factor (/proc/PID/oom_score_adj) that administrators and container runtimes can tweak. Killing the scoring winner releases its memory and the allocation that triggered the scan can proceed.

OOM kills are visible in dmesg with a characteristic "Out of memory: Killed process" message plus a dump of memory zones and process RSS. They are the reason container orchestrators like Kubernetes set cgroup memory limits: killing a runaway pod is infinitely preferable to letting it force the OOM killer to pick an arbitrary innocent process on the node.

You can tune OOM behaviour per process via oom_score_adj, or disable overcommit entirely with /proc/sys/vm/overcommit_memory:

0 (heuristic): allow overcommit up to an estimated safe level.
1 (always overcommit): never refuse malloc based on available memory. Effectively the default for systems that fork large processes.
2 (strict): refuse allocations beyond swap + overcommit_ratio% of RAM. This turns OOM into allocation failures, which is what servers with careful memory budgets want.

Databases and JVMs often set strict overcommit so allocation failures become explicit errors rather than surprise OOM kills later.

Kernel Address Space And KPTI

For most of Linux's history, the kernel mapped itself into the top half of every process's page table. User code at 0x7fff... and kernel code at 0xffff800... shared the same CR3, with the kernel pages marked user-inaccessible via the U/S bit. This is fast: a system call is just a flag flip, no CR3 write, no TLB flush, no cold cache. The cost of a syscall was under a hundred nanoseconds.

Then Meltdown happened.

In January 2018, researchers disclosed that on nearly every Intel x86_64 CPU of the previous decade, a userspace speculative execution window could be used to read kernel memory through a cache side channel, despite the U/S bit forbidding it. The bit was checked, but not before the speculative load had already pulled the secret byte into a cacheline in a fashion that observably influenced subsequent timing. In practical terms, any user process could read kernel memory (and therefore other processes' memory through the kernel's direct map) at multi-kilobyte-per-second rates.

The emergency mitigation was Kernel Page Table Isolation (KPTI). With KPTI, the kernel maintains two different page tables per process. One for user mode contains only a minimal trampoline section of the kernel needed for entering the kernel on syscall or interrupt. One for kernel mode contains the full kernel mapping. On every transition between user and kernel, CR3 is switched, which means writing a new top-level page table base and losing most TLB entries that belonged to the old context.

KPTI restored the memory isolation that the U/S bit used to provide. The cost was an immediate 5% to 30% slowdown on syscall-heavy workloads, depending on the generation of the CPU. On a Raptor Lake or newer Intel chip, the impact is smaller because PCID-based TLB tagging lets the CPU avoid full TLB flushes on CR3 switches. On Meltdown-vulnerable older chips without PCID (Core i5 of 2012 vintage, say), the slowdown was significant and very visible in workloads like heavy database inserts.

AMD Zen chips were never vulnerable to Meltdown and do not need KPTI. Modern Intel chips (Tiger Lake and later) have hardware fixes and also skip KPTI. On older Intel, KPTI is mandatory, and you can see it in dmesg: "Kernel/User page tables isolation: enabled". Watching KPTI appear in distribution kernels in January 2018 was a master class in how tightly coupled operating system performance and CPU architecture are.

Cgroup Memory Limits

On a busy Linux server, "how much memory does this machine have" is rarely the right question. The right question is "how much memory does this container have". The answer comes from cgroups.

The memory controller (memcg) in cgroup v2 lets you cap a group of processes at a fixed amount of RAM, or give them a soft target, or put a high-water mark at which reclaim kicks in without triggering OOM. The three knobs per cgroup are memory.max (hard limit, hit it and the OOM killer runs inside the group), memory.high (throttle limit, the kernel starts aggressive reclaim when you exceed it), and memory.low (protect the group from reclaim by other groups until usage drops below this).

Inside a memcg, the kernel maintains independent LRU lists, independent reclaim state, and independent watermarks. A process running in a 1 GiB cgroup on a 512 GiB machine sees only 1 GiB of memory pressure. If it forks a child that copies-on-write too much, only the child's cgroup feels the OOM pain, and the orchestrator reschedules. This is the fundamental building block that lets Kubernetes pack dozens of workloads onto a single node without one noisy neighbour dragging the rest down.

Inspecting a running memcg is straightforward:

cat /sys/fs/cgroup/mygroup/memory.current    # bytes in use
cat /sys/fs/cgroup/mygroup/memory.max        # hard limit
cat /sys/fs/cgroup/mygroup/memory.events     # oom, oom_kill, high, max, low counts
cat /sys/fs/cgroup/mygroup/memory.stat       # detailed counters per memory type

memory.stat breaks usage down into anon, file, kernel, percpu, sock, shmem, thp, and many others. It is the cgroup analogue of /proc/meminfo and usually the first place to look when a container is mysteriously getting OOM-killed while still appearing "mostly empty" from inside.

ASLR And The vDSO

Two small but important features make the virtual layout of a process unpredictable and fast at the same time.

Address Space Layout Randomisation shuffles where things live in virtual memory on every new process. The stack, the heap (brk), the mmap region where libc.so.6, ld-linux.so.2, and anonymous mappings go, and with PIE binaries the text segment itself: all of them start at a randomised offset chosen at exec time. The randomness comes from a kernel-side random number generator seeded at boot. On x86_64 Linux, there are typically 28 to 30 bits of entropy for mmap regions, which is enough that guessing any particular pointer's address is hopeless.

ASLR is not a silver bullet: information leaks can disclose one pointer, and from one pointer most of the rest of the address space falls out. But it turned dozens of classes of buffer overflow bugs from instant exploits into "need a separate info leak first", which is a substantial improvement. You can inspect ASLR at runtime via /proc/sys/kernel/randomize_va_space, which should be 2 on any modern kernel.

The vDSO (virtual Dynamic Shared Object) is a small shared library that the kernel maps into every process at a random address. It contains fast-path implementations of a handful of system calls: gettimeofday, clock_gettime, getcpu, and time. These calls do not have to trap into the kernel; they read a data page that the kernel keeps up to date and compute the answer in userspace.

Without the vDSO, a well-written C program calling clock_gettime millions of times per second would context-switch into the kernel millions of times per second. With the vDSO, each call is a few tens of nanoseconds of userspace arithmetic. You can see the vDSO in /proc/$PID/maps as a line ending in [vdso] and its static counterpart [vvar]. It is one of the quieter, nicer pieces of Linux engineering.

SMEP, SMAP, And Userfaultfd

Two CPU features called SMEP (Supervisor Mode Execution Prevention) and SMAP (Supervisor Mode Access Prevention) tighten kernel-user isolation further than the ancient U/S bit. SMEP forbids the kernel from executing code that lives on user pages; without it, a kernel bug that jumped to a user-controlled address could run attacker-supplied code in kernel mode. SMAP forbids the kernel from reading or writing user pages except inside specific accessor functions (copy_from_user, copy_to_user) that bracket access with the stac/clac instructions. Both are on by default on any CPU that supports them.

Userfaultfd is the other direction: it lets userspace handle its own page faults. A process creates a userfaultfd, registers a virtual address range with it, and when another thread in that process (or another process sharing the memory) touches an unmapped page in the range, the kernel sends a fault notification to the userfaultfd instead of handling it internally. The handler can then populate the page by any means it likes (reading from a network, decompressing, copying from another region) and tell the kernel to resume the faulting thread.

This mechanism powers live migration in QEMU/KVM, post-copy migration in CRIU, and a handful of advanced userspace page managers. It is also occasionally misused: there were a few spectacular Linux kernel exploits in 2015 to 2017 that relied on userfaultfd to stop a kernel thread mid-race in exactly the right place. The kernel now restricts unprivileged userfaultfd by default.

Memory Zones And NUMA

A modern server with two Intel Sapphire Rapids sockets has memory attached to each socket's memory controller. A load that goes to the local socket's DRAM takes perhaps 80 nanoseconds. A load that has to cross the CPU interconnect to reach the other socket's DRAM takes 140 nanoseconds or more. This is NUMA: Non-Uniform Memory Access.

Linux models NUMA with one memory zone per socket (called a NUMA node). Each node has its own free lists, its own LRU lists, its own kswapd. The scheduler tries to place processes near their memory, and the allocator tries to allocate pages on the node where the allocating thread is currently running. You can inspect the layout with:

numactl --hardware
cat /proc/$PID/numa_maps

On a dual-socket workstation in an Amsterdam rendering house, a Blender render farm node might be pinned to a single NUMA node with numactl --cpunodebind=0 --membind=0 blender ... to guarantee memory locality. On a database server, the DBA will tune the BIOS interleaving settings and the Linux numad daemon to avoid cross-socket traffic.

Even on a single-socket machine, the page allocator carves RAM into zones: DMA (legacy 16 MiB at the bottom, for ancient ISA devices), DMA32 (bottom 4 GiB, for 32-bit devices), Normal (the bulk of RAM), and on huge machines sometimes Movable. Each zone has independent watermarks and reclaim logic. You can see them in /proc/zoneinfo, and on a healthy system, most of the interesting action is in Normal.

What Your Pointers Actually Are

Coming full circle: when a C program on Linux does

char *buf = malloc(64 * 1024);
buf[0] = 'A';

the malloc call goes to glibc's ptmalloc, which may or may not ask the kernel for memory. If this is the first large allocation, ptmalloc calls brk to grow the heap VMA, or mmap to create a fresh anonymous VMA. Either way, no physical memory is allocated. The pointer is a virtual address inside a VMA with no backing.

When you write the first byte, the CPU tries to translate the virtual address. PML4 entry 254 is populated by glibc's mmap; the PDPT exists; the PD entry is populated because this VMA already has a page directory; but the PT entry for this specific page is absent. The CPU raises a page fault. The kernel allocates a physical frame (maybe from the per-CPU pageset cache, which keeps hot frames warm), zeroes it, installs a PT entry pointing at it with write permission, flushes the TLB entry, and returns. The faulting store instruction re-executes, the translation now succeeds via TLB insertion, the byte goes into DRAM.

Every subsequent store to another byte of the same 4 KiB page uses that PTE and sees no fault. Every store crossing into a new 4 KiB page raises another fault.

If you then do

char *buf2 = malloc(64 * 1024);
memset(buf2, 'B', 64 * 1024);
free(buf);

the memset causes 16 more faults, one per new page. The free may or may not return memory to the kernel. Modern ptmalloc uses an arena-per-thread design and only releases memory back to the kernel via munmap or madvise(MADV_DONTNEED) when certain thresholds are crossed. Smaller frees are held in freelists for reuse, which keeps allocation latency low but can create the illusion of memory leaks: RSS goes up, rarely comes down, even though the heap is mostly empty.

MADV_DONTNEED tells the kernel that a range is not needed right now, and it should throw away the physical frames backing it. The next access will fault fresh zeros in. Some garbage collectors (notably Go's) use this aggressively: their heap is large but the OS-visible RSS can shrink when the GC decides it no longer needs certain spans. Other allocators use MADV_FREE, which is similar but lazier: the kernel may reclaim the pages only under pressure and otherwise keep the contents available.

Watching Virtual Memory Move

Every ingredient we have discussed is visible on a live system if you know where to look. A few commands worth remembering.

# Entire address space layout of a process
cat /proc/$PID/maps            # textual VMA list
cat /proc/$PID/smaps           # VMAs with per-mapping memory stats
pmap -x $PID                   # pretty-printed version
 
# Page-level state
cat /proc/$PID/pagemap         # (binary) virt -> physical mapping, swap info
cat /proc/kpageflags           # flags for each physical frame
 
# System-wide
cat /proc/meminfo              # free, cached, buffers, huge pages, swap
cat /proc/vmstat               # fault counts, reclaim counters, compaction
cat /proc/zoneinfo             # zones, watermarks, free lists
 
# Performance
perf stat -e dTLB-load-misses,page-faults ./app
perf record -e page-faults ./app && perf report

smaps in particular is a goldmine. Every VMA appears with fields for Size, Rss, Pss (proportional set size, which splits shared pages among sharers), Private_Clean, Shared_Clean, Referenced, Anonymous, ShmemPmdMapped (huge THP), Swap, SwapPss, Locked, THPeligible. If you are chasing down a memory issue, there is probably a question you can answer by reading smaps and adding up a field.

The Mental Model To Keep

A handful of ideas are enough to reason about almost any virtual memory puzzle.

Every load and store goes through a four-level page table, cached in a TLB. If the TLB is hot, translation is free. If not, you pay for a walk. Huge pages stretch the TLB by making each entry cover more memory.

Memory exists in three populations: anonymous pages (heap, stack, mmap(MAP_ANONYMOUS)), file-backed pages (mmap of files, plus the entire page cache), and kernel pages. The first two compete for RAM and are managed through LRU lists. The kernel decides who wins based on swappiness, pressure, and access patterns.

The kernel is lazy. Nothing is allocated until it is touched. Nothing is copied until it is written. Nothing is evicted until it is necessary.

Faults are the moments the kernel pays for laziness. Minor faults are cheap. Major faults hit disk and are slow, though NVMe has made them less painful than they used to be.

Everything you think of as "memory" is a deal between user code, the kernel, and the MMU, mediated by a tree of tables. Once that clicks, the behaviour of tools like free, perf, and pmap stops looking arbitrary, and the gap between "my program allocated this much" and "the kernel says RSS is this much" becomes easier to explain.

The core mechanism is not complicated. A tree of 512-entry tables maps virtual addresses to physical pages. A TLB caches the hot translations. A fault handler fills in what is missing. Most of the rest is policy layered on top: when to share pages, when to swap them out, when to promote huge pages, and how aggressively to reclaim memory under pressure. That is the model worth keeping in your head.