← Back to Logs

How the Linux Kernel Handles Memory Allocation

Try the interactive lab for this articleTake the quiz (6 questions · ~5 min)

The Linux kernel allocates memory all the time. Network stacks need packet buffers. Filesystems need dentries, inodes, and page-cache pages. The scheduler needs task structures and runqueue state. Device drivers need DMA-capable regions. Virtual memory code needs page tables, reverse-mapping structures, and metadata. None of these consumers want exactly the same thing. Some need physically contiguous pages. Some only need one virtually contiguous range. Some cannot sleep. Some can reclaim. Some need tiny hot objects with low overhead. Some need large buffers that would be fragile if they required physical contiguity.

This is why kernel memory allocation looks complicated from the outside. You see:

  • alloc_pages
  • __get_free_pages
  • kmalloc
  • kzalloc
  • kmem_cache_alloc
  • vmalloc
  • GFP flags
  • reclaim
  • compaction
  • OOM reports

and it can feel like a pile of overlapping APIs. It is not. It is a layered system built to satisfy different constraints without collapsing performance or reliability.

The clean way to understand Linux kernel allocation is to move from bottom to top:

  1. physical memory appears as page frames
  2. the buddy allocator manages free blocks of contiguous pages
  3. slab or SLUB allocators turn pages into caches of small objects
  4. kmalloc routes small general-purpose allocations into those caches
  5. vmalloc trades physical contiguity for virtual contiguity
  6. reclaim and compaction try to recover or reshape memory under pressure
  7. the OOM killer steps in when the kernel cannot make enough progress

By the end of this article you should be able to read a page-allocation warning, look at /proc/buddyinfo or /proc/slabinfo, and understand what kind of scarcity the machine is really facing.

Stage 0: Firmware Memory Maps Become Kernel Page Frames

Linux begins with whatever firmware says RAM looks like. Early in boot, firmware or the hypervisor provides a memory map. Linux parses it and identifies:

  • usable RAM
  • reserved regions
  • ACPI and firmware areas
  • initramfs location
  • kernel image placement
  • device-specific reservations

The kernel then builds metadata for physical pages, commonly represented by struct page. Conceptually, RAM becomes a large indexed set of page frames:

PFN 0
PFN 1
PFN 2
...

PFN means page frame number. Each frame has metadata that can describe:

  • whether the page is free or allocated
  • which zone it belongs to
  • reference counts
  • dirty or writeback state
  • LRU linkage
  • mapping information
  • many other flags

Nothing higher-level works without this substrate. Filesystem caches, network buffers, slab objects, anonymous memory, and page tables all ultimately depend on these physical page frames.

Stage 1: Zones Matter Before You Ever Ask for a Page

Linux does not treat all RAM as one undifferentiated pool. It divides memory into zones because some consumers have constraints that others do not. Common zones include:

  • ZONE_DMA
  • ZONE_DMA32
  • ZONE_NORMAL
  • ZONE_MOVABLE

The exact makeup depends on architecture and machine size, but the idea is stable. Some devices can only DMA into lower address ranges. Some memory is better reserved for movable pages to make compaction easier. Some allocations must stay out of certain areas.

This matters because "there is free RAM" is often not enough. The allocator may need:

  • memory from the right zone
  • a block of the right order
  • pages with the right migration properties

An allocation can fail while other zones still have space. That is one reason page-allocation warnings log zone and GFP information. The shortage is often contextual, not absolute.

You can inspect a deeper zone view with:

cat /proc/zoneinfo

That file is long, but it is the right place to look when the machine appears to have memory while a specific class of allocations still struggles.

Stage 2: The Buddy Allocator Is the Core Page Allocator

Linux's primary page allocator is the buddy allocator. It manages free memory in blocks whose sizes are powers of two. The unit is pages, not bytes.

Orders work like this:

  • order 0 = 1 page
  • order 1 = 2 pages
  • order 2 = 4 pages
  • order 3 = 8 pages
  • order 4 = 16 pages

and so on.

If a request needs four contiguous pages, the allocator wants an order-2 block. If such a block is not available, it looks for a larger one and splits it into smaller buddies until the desired order appears.

When memory is freed, the allocator checks whether the block's matching buddy is also free. If both are free and the same order, they can merge into the next larger order. This merge can repeat upward.

That split and coalesce behaviour is the heart of the design. It gives Linux:

  • fast access to contiguous page blocks
  • efficient reuse
  • a simple model for rebuilding larger extents when fragmentation allows it

The buddy allocator is not a general allocator for arbitrary small objects. It is a page-run allocator. Everything above it either uses pages directly or builds finer-grained allocation mechanisms on top.

Stage 3: What Actually Happens on Allocation

Suppose the kernel wants an order-2 allocation, four contiguous pages.

If an order-2 block exists on the free list, great. The allocator removes it and returns it.

If not, it looks for an order-3 block. If it finds one, it:

  1. removes the order-3 block
  2. splits it into two order-2 buddies
  3. returns one buddy
  4. places the other back on the order-2 free list

If no order-3 block exists, it tries order 4, then order 5, and continues upward until it either finds a larger block to split or exhausts options.

Freeing does the reverse. If an order-2 block is freed, the allocator checks whether its matching order-2 buddy is also free. If yes, they merge into order 3. Then the allocator checks whether that merged order-3 block's buddy is free, and so on.

This is one reason the allocator is easy to visualise and reason about. Memory pressure and fragmentation become questions about free blocks at different orders.

Stage 4: /proc/buddyinfo Shows the Real Shape of Fragmentation

One of the best files for understanding page fragmentation is:

cat /proc/buddyinfo

Sample output looks conceptually like this:

Node 0, zone   Normal  1245 892 401 210 98 41 17 8 2 0 0

Each column is the count of free blocks at a given order. If you see lots of order-0 and order-1 free blocks but almost nothing at higher orders, the machine is fragmented. There may be plenty of free pages in total, but few large contiguous extents remain.

This is exactly the pattern that causes pain for:

  • large network buffers
  • huge pages
  • CMA consumers
  • drivers that need physically contiguous memory

A machine can have gigabytes free and still fail an order-9 allocation. That is not a contradiction. It is a contiguity problem.

Stage 5: GFP Flags Tell the Allocator What It Is Allowed to Do

Kernel allocation APIs almost always include GFP flags, short for get-free-pages flags. These are crucial because they describe what the allocator may do on the caller's behalf.

Common flags include:

  • GFP_KERNEL
  • GFP_ATOMIC
  • GFP_NOWAIT
  • GFP_DMA32
  • __GFP_ZERO
  • __GFP_RECLAIM

The high-level meaning:

  • GFP_KERNEL: normal sleeping allocation in process context
  • GFP_ATOMIC: do not sleep, use emergency paths if necessary
  • GFP_NOWAIT: fail quickly
  • GFP_DMA32: require a specific DMA-reachable zone
  • __GFP_ZERO: zero memory before return

The critical point is that allocations are not just about size. They are also about context. A network softirq path and a normal process syscall can ask for the same number of bytes and get different outcomes because one may sleep and reclaim while the other cannot.

This is why page-allocation warnings often print both order and GFP mask. Those two details tell you the shape of the request:

  • how much contiguous memory it wanted
  • what the allocator was permitted to do to satisfy it

Stage 6: alloc_pages and __get_free_pages Sit Close to the Buddy Layer

Some subsystems want pages directly. They use interfaces like:

struct page *page = alloc_pages(GFP_KERNEL, order);
unsigned long addr = __get_free_pages(GFP_KERNEL, order);

These calls are close to the buddy allocator. They make sense when the caller naturally thinks in page units, for example:

  • page cache management
  • page-table creation
  • networking page fragments
  • low-level driver buffers

The tradeoff is that the caller must care about page orders, contiguity, and page-based lifetime rules. These are not the APIs most kernel code wants for every small structure.

Stage 7: Small Kernel Objects Need a Different Strategy

Many kernel objects are much smaller than a page:

  • dentries
  • inodes
  • task_struct-related pieces
  • tiny driver-private state objects
  • timer structures
  • lock structures

If every 64-byte object consumed its own 4 KiB page, memory efficiency would be awful. The kernel needs a way to obtain pages from the buddy allocator and carve them into many equal-sized objects.

This is where the slab-family allocators come in.

Linux has used several implementations over time:

  • SLAB
  • SLUB
  • SLOB

Modern general-purpose kernels usually use SLUB. The conceptual model remains close to the classic slab idea:

  • obtain one or more pages from the buddy allocator
  • divide those pages into objects of one size class
  • keep track of which objects are free
  • keep frequently used caches warm for performance

This solves two problems at once:

  • memory efficiency for small objects
  • speed, because hot object types can be recycled without going back to the buddy allocator every time

Stage 8: kmalloc Is a Front Door Into Size-Class Caches

For many callers, the main allocation interface is kmalloc.

void *p = kmalloc(192, GFP_KERNEL);

kmalloc looks like the kernel version of malloc, but internally it usually routes the request into a size-class cache. Typical cache names include:

  • kmalloc-8
  • kmalloc-16
  • kmalloc-32
  • kmalloc-64
  • kmalloc-128
  • and upward through larger classes

So a 33-byte request often lands in kmalloc-64. A 192-byte request lands in an appropriate larger cache. The memory still ultimately comes from pages, but the caller does not have to think about splitting pages into small pieces manually.

This layering is worth stating explicitly:

kmalloc(size)
  -> size-class cache
  -> slab or SLUB objects carved from pages
  -> pages from buddy allocator
  -> physical RAM page frames

kzalloc adds zeroing. kcalloc helps with array allocation and overflow safety.

Stage 9: Dedicated Caches Exist for Important Object Types

The generic kmalloc caches are not the whole story. Many hot kernel object types use dedicated caches created with interfaces like kmem_cache_create. That allows:

  • exact object sizing
  • alignment control
  • constructors
  • accounting per object type
  • debugging options

Examples include caches for:

  • dentries
  • inode structures
  • task-related objects
  • filesystem-specific metadata

This is why /proc/slabinfo is so informative. It shows you not only generic kmalloc cache usage but also which specific kernel object populations are large at the moment.

You can inspect it with:

cat /proc/slabinfo | head

and often more usefully:

grep -E 'dentry|inode|kmalloc' /proc/slabinfo

Large slab use does not automatically mean a leak. It may mean the kernel is caching useful objects that can shrink under pressure.

Stage 10: SLUB Uses Per-CPU Paths to Stay Fast

A heavily loaded system cannot afford to take a global lock on every tiny allocation. Modern SLUB designs avoid that in the common case by using per-CPU state. The rough pattern is:

  • each CPU has quick access to some free objects in hot caches
  • only when local state empties or overflows does the allocator interact with more shared structures

This is a major reason small kernel allocations can be fast under load. The kernel avoids unnecessary cross-CPU contention in the common case.

The cost is more complexity and sometimes some free memory sitting in per-CPU caches rather than one central pool. That is usually a good trade because the whole machine would otherwise bottleneck on allocator locks.

Stage 11: vmalloc Solves a Different Problem

Some callers need a large contiguous virtual range in kernel address space but do not need physical contiguity. vmalloc exists for that case.

vmalloc:

  • allocates pages individually, which may be physically scattered
  • creates kernel page-table mappings so they appear contiguous in virtual address space

This is useful for:

  • larger buffers
  • data structures where physical adjacency is irrelevant
  • cases where forcing physical contiguity would be too fragile

The tradeoffs:

  • more setup cost
  • more page-table overhead
  • worse TLB behaviour than a physically contiguous mapping can have
  • unsuitable for DMA workloads that require physical contiguity

This is one of the most important distinctions in the kernel allocator story:

  • kmalloc tends to imply physically contiguous memory, at least within the size and implementation limits involved
  • vmalloc gives virtual contiguity only

Confusing these leads to bad design decisions and bad debugging assumptions.

Stage 12: vmalloc Success Does Not Prove kmalloc Would Succeed

A common operational mistake is to reason only in total bytes. Suppose a subsystem needs a 2 MiB buffer.

  • vmalloc(2 MiB) may succeed because the kernel can gather many separate pages and map them into one virtual range.
  • kmalloc(2 MiB) may fail because the machine does not have a single sufficiently contiguous physical block left.

This is not inconsistent. These APIs ask for different guarantees.

When someone says "the machine has free RAM, why did the allocation fail?", the right follow-up questions are:

  • did it need physical contiguity
  • what order was requested
  • which GFP flags were used
  • in which zone did it need pages
  • could the caller sleep and reclaim

The byte count alone almost never answers the real question.

Stage 13: Fragmentation Hurts High Orders First

Fragmentation means free memory exists, but it is split into small pieces. Under fragmentation:

  • order-0 allocations may keep succeeding
  • high-order allocations start failing
  • compaction may run more often
  • latency can spike

This happens because larger buddies are harder to preserve over time. Long-lived allocations, pinned pages, and mixed mobility patterns break memory into smaller islands. A machine with plenty of free pages can still lack a free order-9 or order-10 block.

Common sources of fragmentation pressure include:

  • long-lived page pinning
  • DMA-heavy workloads
  • transparent huge page demands
  • complex device and memory hotplug environments
  • workloads that mix movable and unmovable pages densely

This is why /proc/buddyinfo is more honest than a generic "free memory" graph when the question is high-order allocation health.

Stage 14: Reclaim Tries to Recover Memory Before Failure

When memory pressure rises, Linux does not immediately admit defeat. It first tries reclaim. Reclaim is the process of finding memory that can be freed or repurposed, such as:

  • clean page-cache pages that can simply be dropped
  • dirty cache pages that can be written back and then freed
  • anonymous pages that can be swapped
  • shrinkable kernel caches

Two important forms exist:

  • background reclaim, commonly through kswapd
  • direct reclaim, where the allocating task itself gets dragged into reclaim work

Direct reclaim is one reason systems under memory pressure can feel sluggish. A task that wanted memory for its real work now spends time scanning LRUs, writing pages back, or waiting on reclaim-related activity.

Useful observation tools:

vmstat 1

Fields such as si, so, and other VM counters help show whether the machine is reclaiming and swapping aggressively.

Stage 15: Slab Shrinkers Pull Back Cached Kernel Objects

Because the kernel caches lots of small objects, reclaim also needs a way to reduce slab-backed populations. Subsystems provide shrinkers so the VM can ask:

  • can you give back some cached dentries
  • can you reduce inode cache pressure
  • can you drop reclaimable metadata objects

This is one reason a huge dentry cache is not automatically bad. The right question is not "is this cache large". The right question is "does it shrink when the machine needs the memory".

This is also why total slab growth can be healthy during active workloads. Caches speed the system up. The VM layer only pushes them down when there is a better use for the memory.

Stage 16: Compaction Tries to Rebuild Larger Extents

Reclaim frees pages. Compaction moves pages around to create larger contiguous free blocks.

Suppose a high-order allocation wants 512 contiguous pages. The machine may have enough total free memory, but in little islands. Compaction:

  • isolates movable pages
  • migrates them elsewhere
  • tries to pack free space together
  • rebuilds larger contiguous buddies

This is expensive work. It requires:

  • identifying which pages are movable
  • copying content
  • updating references and mappings
  • avoiding pinned or unmovable pages

Compaction can help transparent huge pages, large page-cache needs, and some driver or allocator demands. It cannot perform miracles if too much memory is pinned or fragmented in hostile ways.

If compaction repeatedly fails, large-order allocations stay unreliable even while order-0 pages remain available.

Stage 17: Transparent Huge Pages Depend on the Same Story

Transparent Huge Pages are a useful example because they connect virtual memory performance to physical allocation realities. A huge page often wants a larger contiguous backing extent. If the machine is fragmented, THP allocation may fail or require more compaction effort.

This is one reason allocator health affects performance beyond simple "out of memory" events. Fragmentation can change:

  • TLB behaviour
  • CPU overhead in memory-heavy workloads
  • latency due to compaction
  • success rate of large mappings

The allocator is not just a correctness mechanism. It shapes runtime performance.

Stage 18: DMA and CMA Make Physical Layout Matter More

Some drivers need memory with extra constraints:

  • physically contiguous
  • DMA-addressable in a specific range
  • with alignment rules

This is where zones and features like CMA, contiguous memory allocator reserves, become especially relevant. A driver may fail not because memory is exhausted globally, but because the kind of memory it needs has become hard to satisfy.

This is common in:

  • graphics
  • cameras
  • media pipelines
  • embedded systems
  • some network devices

The allocator problem then becomes a topology problem, not just a capacity problem.

Stage 19: What Page Allocation Failures Are Really Telling You

A page allocation failure log might mention:

  • order
  • GFP mask
  • zone information
  • reclaim attempts
  • compaction state

A stylised example:

page allocation failure: order:9, mode:0x140c0c0(GFP_KERNEL|__GFP_COMP), nodemask=(null)

This message is already telling a story:

  • order 9 means a large contiguous request
  • the GFP mask tells you what the allocator could attempt
  • the failure may be fragmentation, not absolute exhaustion

Reading these logs well is mostly about internalising the allocator layers described in this article. Once you know the meaning of order, contiguity, zones, reclaim, and compaction, the warning stops looking opaque.

Stage 20: The OOM Killer Is the End of a Long Escalation Ladder

The Out Of Memory killer is not the first response to pressure. Linux usually reaches it only after:

  • ordinary allocation paths fail
  • reclaim cannot free enough useful memory
  • compaction cannot create the necessary extents
  • progress is no longer likely without killing something

The OOM killer then chooses a victim using heuristics that take into account things like:

  • memory footprint
  • oom_score_adj
  • cgroup boundaries
  • how much killing a task is likely to free

On cgroup-constrained systems, OOM can be local to the cgroup. A container can be killed for exceeding its memory limit while the host still has plenty of memory overall. This is a major source of confusion in orchestrated environments.

The key point is simple. OOM means the kernel could not make enough forward progress through the available recovery mechanisms. It does not simply mean "RAM reached zero".

Stage 21: The Triggering Task Is Not Always the Victim

Another common misunderstanding is to assume the process that asked for memory must be the one that dies. Not necessarily. The kernel chooses a victim according to its heuristics and policy constraints. The process that triggered the final failed allocation may survive while another process is sacrificed because it is a better target for memory recovery.

This is why OOM logs should be read carefully rather than emotionally. The line that names the killed task is only part of the story. The lines describing allocation context and the machine's memory state often tell you more about the true cause.

Stage 22: Memory Cgroups Change the Arena

On systems using memory cgroups, allocation and OOM behaviour can be scoped to a container or service boundary. This means:

  • reclaim can happen within the cgroup
  • limits can trigger local OOM before host exhaustion
  • one workload can fail while another remains healthy

This is essential for modern multi-tenant systems, but it also means "the machine had free RAM" is an incomplete statement. The relevant arena may have been only the service's memory cgroup.

In Kubernetes or other container environments, this is often the reason an application is OOM-killed while node-level graphs still look comfortable.

Stage 23: A Concrete Example, High-Rate Network Receive

Consider a busy network driver under load.

  1. The driver needs buffers for received packets.
  2. Some of those buffers may come from page-based allocation paths with strict context limits.
  3. Small metadata objects come from slab caches.
  4. Under pressure, reclaim begins dropping cache and shrinking slabs.
  5. If a higher-order allocation is required and fragmentation is bad, compaction may run or fail.
  6. Packet drops rise even before the system is globally out of memory.

This example shows why allocator behaviour is visible in workload performance. Memory management is not just a background bookkeeping task. It shapes throughput, latency, and failure modes directly.

Stage 24: A Concrete Example, Filesystem Metadata Growth

Now consider a machine scanning millions of files.

The filesystem and VFS accumulate:

  • dentries
  • inodes
  • page-cache pages

Performance improves because lookups hit warm caches. Memory use rises because those caches are doing real work. Under pressure:

  • reclaim can drop page cache
  • shrinkers can reduce reclaimable dentry and inode caches

If those caches refuse to shrink when they should, that starts to look like a leak. If they shrink appropriately, it is healthy caching behaviour.

A static "slab usage is high" alert is often low value. Without pressure context, it says little.

Stage 25: Observability Tools That Actually Help

Different tools answer different allocator questions.

Contiguity and fragmentation

cat /proc/buddyinfo

Slab and object cache populations

cat /proc/slabinfo

Zone-level watermarks and detail

cat /proc/zoneinfo

VM behaviour over time

vmstat 1

OOM and allocation warnings

dmesg | grep -Ei 'oom|page allocation failure'

Good debugging means picking the tool that matches the layer:

  • buddy-level problem, use buddyinfo
  • slab growth question, use slabinfo
  • pressure and reclaim question, use vmstat
  • victim and failure context, use kernel logs

Stage 26: The Whole Stack in One View

The Linux kernel memory allocator stack is easiest to remember as layers:

physical RAM
  ->
page frames with metadata
  ->
zones
  ->
buddy allocator for page runs
  ->
slab or SLUB caches for small objects
  ->
kmalloc and dedicated object caches
  ->
vmalloc for virtually contiguous mappings
  ->
reclaim, compaction, and OOM when pressure rises

Each layer exists because the lower layer alone would be too blunt:

  • buddy allocation is good for page runs, poor for many tiny objects
  • slab is good for tiny objects, poor for arbitrary large virtual buffers
  • vmalloc is good for virtual contiguity, poor for DMA or some hot paths

Linux uses several allocators because the kernel has several different memory problems.

Stage 27: Per-CPU Page Lists Keep the Fast Path Cheap

The allocator story is not only about logical layers. It is also about avoiding contention. Linux uses per-CPU page lists so common small page traffic does not always pound on shared zone locks.

The idea is simple:

  • CPUs hold small local stocks of pages
  • hot order-0 traffic is often satisfied locally
  • refills and drains eventually touch the wider buddy state

This matters because a modern multi-core machine can generate huge allocator traffic. Without per-CPU fast paths, ordinary workloads would spend far more time contending on global allocator structures.

Stage 28: The Slowpath Is Where Pressure Becomes Visible

A successful fast allocation may look trivial. The allocator slowpath is where the kernel starts working much harder. If a request cannot be served immediately, Linux may:

  • search other free lists
  • consider fallback migratetypes
  • wake or depend on kswapd
  • enter direct reclaim
  • attempt compaction
  • use reserves depending on flags
  • fail and emit diagnostics

This is why allocator behaviour can change suddenly under pressure. The same request size can go from cheap to expensive because the machine crossed from a free-list hit into reclaim and compaction work.

For latency-sensitive code paths that transition matters nearly as much as outright failure.

Stage 29: NUMA Means Not All Successful Allocations Are Equal

On NUMA systems, locality matters. Memory attached to one node is cheaper for CPUs near that node than memory attached elsewhere. Linux therefore tries to allocate with node locality in mind.

This introduces another dimension beyond:

  • zone
  • order
  • GFP constraints

Now you also care about:

  • preferred node
  • fallback node
  • memory policy

A NUMA-local shortage can hurt performance even when total host memory is comfortable. That is another reason "the machine has free RAM" is a weak statement in performance debugging.

Stage 30: Anonymous Memory and Page Cache Behave Differently Under Reclaim

Reclaim is not one homogeneous action. The VM often has to balance two broad memory populations:

  • anonymous pages belonging to processes
  • file-backed page-cache pages

Clean file-backed cache may be cheap to drop. Anonymous memory may need swap. Dirty file-backed pages may need writeback before they become reclaimable.

This means reclaim behaviour depends strongly on workload shape. A build host with a huge filesystem cache behaves differently from a JVM-heavy service whose working set is mostly anonymous heap.

Both are "using memory". The cost of reclaiming that memory is not the same.

Stage 31: Writeback Couples Memory Pressure to Storage Pressure

Dirty cache pages are a bridge between memory management and storage performance. If reclaim wants those pages back, writeback may have to happen first.

That creates a common production pattern:

  • memory pressure rises
  • reclaim starts chasing dirty pages
  • writeback activity rises
  • storage latency increases
  • allocations become slower because memory recovery is now partly gated by I/O

This is one reason memory incidents can show up as storage incidents. The allocator and the block layer are often part of the same chain once dirty data is involved.

Stage 32: Pinned Pages Make Compaction Harder

Some pages are easy to move. Others are not. Long-lived pinned pages from direct I/O, device mappings, RDMA, or some media paths restrict what compaction can accomplish.

This matters because the machine may still have a lot of free and movable memory overall, yet a handful of stubborn pinned regions can prevent the formation of large clean extents where they are needed.

The operational consequence is subtle. A workload can cause allocator pain not because it is using extraordinary amounts of memory, but because it is using memory in ways that reduce mobility and compaction success.

Stage 33: SLUB Debugging Changes Allocator Behaviour on Purpose

Debug kernels often enable allocator features such as:

  • red zones
  • poisoning
  • stack tracking
  • stricter sanity checks

These help catch:

  • use-after-free
  • double-free
  • object overruns
  • freelist corruption

They also change performance and memory overhead. A debug kernel is not allocator-neutral. This matters if you are comparing behaviour between debug and production kernels. The allocator itself is behaving differently by design.

Stage 34: vmalloc Uses Virtual Space, Not Magic

vmalloc can feel like an easy escape from physical-contiguity problems, but it spends kernel virtual address space and page tables to buy that flexibility. On 64-bit systems this is often fine, but it is still a real resource trade.

Very heavy vmalloc use can mean:

  • more mapping overhead
  • more TLB pressure
  • slower setup and teardown

The right choice between kmalloc and vmalloc is therefore a real design decision, not just a convenience preference.

Stage 35: Huge Pages Turn Fragmentation into a Performance Topic

Huge pages and Transparent Huge Pages connect allocator health directly to CPU performance. If larger extents can be formed, TLB pressure can drop and some workloads speed up. If the machine is fragmented, huge-page formation becomes harder and compaction pressure can rise.

This creates a familiar tuning tension:

  • compact more aggressively and pay latency
  • back off and accept smaller pages

The best answer depends on the workload. Throughput-heavy analytics and latency-sensitive online services often want different tradeoffs.

Stage 36: Memory Hotplug and Movable Memory Influence Policy

Linux memory policy is shaped not only by immediate allocation speed but also by system-management features such as memory hotplug. Movable memory zones help preserve the ability to rearrange or remove memory blocks later.

This may feel distant from ordinary server work, but it explains why migration types and page mobility matter so much. The allocator is designed for more than one machine shape and more than one operational model.

Stage 37: Leak Hunting Requires Knowing Which Layer Holds Memory

When someone suspects a kernel memory leak, the first task is to identify the layer:

  • page allocator consumption
  • slab cache growth
  • page cache growth
  • memcg-local pressure
  • long-lived mapped or pinned pages

Tools like kmemleak can help, but interpretation still requires allocator literacy. A large dentry cache under a filesystem-heavy workload is not automatically a leak. A large kmalloc cache may or may not be. The pressure response tells you more than the steady-state size.

Stage 38: Reading an Incident the Right Way

A disciplined investigation order helps:

  1. Ask whether the problem is capacity, contiguity, locality, or policy.
  2. Inspect buddyinfo, slabinfo, and zoneinfo.
  3. Look for reclaim, compaction, and OOM evidence in logs.
  4. Check whether cgroup limits changed the effective arena.
  5. Ask whether pinned pages or DMA constraints made the request harder than the byte count suggests.

This keeps allocator debugging grounded in the actual failure mode instead of collapsing everything into "low memory".

Stage 39: Compaction Is About Migration, Not Compression

The name can mislead people. Compaction does not compress page contents. It moves movable pages so that free space lines up into larger contiguous extents.

The success of compaction therefore depends on:

  • how many pages are actually movable
  • whether pinned pages block useful migration
  • whether the target order is realistic under current fragmentation
  • whether the machine can afford the migration cost

This matters because compaction can be expensive and still fail. The machine may have enough free memory overall and still not be able to assemble the specific higher-order extent a caller wanted.

Stage 40: Watermarks and kswapd Start Pressure Handling Before Obvious Crisis

Linux does not wait for zero free pages before background reclaim starts. Zones have watermarks. When free memory falls below target levels, kswapd begins reclaim in the background to keep the system away from allocator cliffs.

This explains a common operational pattern:

  • the host is not OOM
  • no obvious allocation failures are visible
  • latency is already worse because reclaim is running

By the time direct reclaim or OOM becomes visible, the VM has often been working for a while to avoid that outcome.

Stage 41: memcg Reclaim Means Shared Hosts Need Two Views

On cgroup-managed systems you often need both:

  • a host-wide allocator view
  • a per-cgroup pressure view

The host may still have comfortable free memory while one service is reclaiming hard or approaching OOM inside its own memory budget. This is especially common on orchestrated systems where the effective arena is the service boundary rather than the whole node.

This is why node-level free-memory graphs are often insufficient for modern Linux memory debugging.

Stage 42: /proc/slabinfo Only Becomes Meaningful When Tied to Pressure

/proc/slabinfo can look alarming because many caches are large even on healthy systems. The useful questions are:

  • which caches are large
  • which should be reclaimable
  • whether they shrink under pressure
  • whether their growth matches the workload

A large dentry cache after a heavy filesystem walk can be healthy. A large cache that stays pinned despite pressure may be suspicious. Pressure response is often more informative than absolute size.

Stage 43: A Better Classification for Real Incidents

When the allocator causes trouble in production, classify the event before diving into logs:

  • not enough total memory
  • enough memory, wrong zone
  • enough memory, not enough contiguity
  • enough memory, wrong locality
  • enough memory, but policy such as memcg prevents access

This small classification step usually cuts the debugging space down dramatically. It tells you whether to think in terms of capacity planning, fragmentation, NUMA placement, DMA constraints, or service-level limits.

Stage 44: Watermark Boosting and Reserve Logic Exist to Protect Forward Progress

The allocator is not only trying to satisfy every request fairly. It is also trying to stop the machine from collapsing into thrash. Zone watermarks and reserve rules exist so that some requests can still succeed when memory is tight.

In practice Linux distinguishes between:

  • ordinary free memory that many callers may consume
  • protected reserve space that should not vanish under normal pressure
  • emergency access paths for callers with stricter context needs

For that reason two allocations of the same size can behave differently near the edge. A sleeping GFP_KERNEL caller may be expected to reclaim or wait. An atomic or interrupt-context caller may need access to reserve-backed fast paths because sleeping is not legal there.

When a kernel log shows low free pages but not a full machine-wide OOM, reserves are often part of the explanation. Linux is trying to preserve just enough allocator headroom that the system can still:

  • complete I/O
  • run reclaim
  • process interrupts
  • execute the code needed to escape pressure

Without reserves, the machine would hit dead ends more often under bursty pressure.

Stage 45: Direct Reclaim Changes Application Latency Even When Allocation Eventually Succeeds

An allocation failure is easy to notice. Direct reclaim is more dangerous because it can turn an otherwise healthy-looking application into a latency disaster while still allowing allocations to succeed.

The pattern is:

  1. a task asks for memory
  2. the fast path cannot satisfy it
  3. the task enters reclaim itself
  4. the allocation later succeeds
  5. the task reports terrible latency because it spent time doing VM work instead of its real job

This is one reason application teams sometimes insist they have an I/O problem, a scheduler problem, or a lock problem when the root cause is actually memory pressure. The thread was not "stuck" in application code. It was conscripted into reclaim.

vmstat 1 often exposes this through:

  • rising si and so
  • reclaim activity
  • growing CPU time in system mode

If tail latency is the symptom, allocator success or failure is not enough. You need to know whether success came cheaply or only after the task paid reclaim costs on the critical path.

Stage 46: Page Cache Growth Is Usually a Performance Feature Until Pressure Tests It

Linux uses free memory aggressively for page cache because unused RAM is wasted opportunity. This creates a recurring confusion in operational reviews. Someone sees page cache growth and concludes memory is leaking. Often the kernel is simply caching useful file data because the machine has room.

The better questions are:

  • does the cache correspond to real file activity
  • does it shrink when pressure arrives
  • are dirty pages being written back at a healthy rate
  • is reclaim cost acceptable for the workload

Healthy page cache growth usually means later file reads become cheaper. The bad case is not "cache exists". The bad case is "reclaim cannot recover enough of it cheaply when another consumer needs memory".

This is why production memory analysis needs workload context. A CI runner, database host, and media-processing box can all show high memory use for totally different good reasons. Absolute used-memory numbers tell you less than the reclaim behaviour of the dominant page populations.

Stage 47: Slab Growth Should Be Read as Object Population, Not as Anonymous Heap

Slab usage alarms often sound dramatic because the total can be large. The right interpretation is usually "which kernel objects are numerous right now" rather than "the kernel heap is out of control".

Large slab populations often correspond to:

  • path lookups and large dentry populations
  • many live sockets
  • inode-heavy scans
  • networking metadata during traffic bursts
  • filesystem-specific metadata for active datasets

That framing changes the next step. If dentry and inode caches are huge, ask whether the machine recently walked massive directory trees. If socket caches are huge, ask whether connection counts spiked. If a kmalloc-* cache is huge and does not shrink, then leak suspicion becomes more reasonable.

The allocator view therefore connects directly to subsystem shape. Slab memory is rarely random. It is usually the memory signature of some active kernel object population.

Stage 48: High-Order Failures Often Mean the Wrong Allocation Strategy, Not Just a Sick Machine

When high-order allocations fail, the machine may truly be fragmented. It is also possible the caller is asking for a brittle form of memory too often.

That leads to a useful engineering question: does this code really need physically contiguous memory at this size?

Sometimes the right fix is not:

  • tune compaction harder
  • grow reserves
  • wait for a newer kernel

Sometimes the right fix is:

  • use page lists instead of one large contiguous run
  • switch to vmalloc
  • redesign the buffer layout
  • reduce burst size so high-order requests are rarer

This is worth stating because allocator incidents are not always solved inside the VM. Bad allocation strategy in one subsystem can manufacture allocator pain for the whole host.

If a workload repeatedly needs order-8 memory on a long-lived fragmented host, that is a design smell even if Linux is doing its job correctly.

Stage 49: NUMA Reclaim and Placement Can Turn a Capacity Problem Into a Topology Problem

On NUMA machines, one node can become hot while others remain comfortable. The allocator then starts making tradeoffs:

  • try harder for local memory
  • fall back to remote memory
  • reclaim locally first
  • preserve policy constraints where possible

That changes performance even when allocations still succeed. A workload can gradually drift from:

  • local pages with good latency
  • to remote pages with worse latency
  • to local reclaim and stalls

From a host-wide dashboard this can look like the machine still has room. From the workload's point of view, its preferred node is already under pressure. Serious NUMA debugging therefore often needs:

  • per-node free memory
  • per-node reclaim activity
  • CPU placement and task locality

The allocator is not just answering "is there memory". It is answering "is there suitable memory near the CPUs that need it".

Stage 50: The Best Mental Check Before Reading Logs

Before opening logs, run one short internal checklist:

  1. Did the request need pages or objects?
  2. Did it need physical contiguity or only virtual contiguity?
  3. Could the caller sleep and reclaim?
  4. Was the shortage host-wide, zone-local, cgroup-local, or node-local?
  5. Was the pain from failure, from direct reclaim, or from compaction cost?

That checklist makes kernel allocator output much easier to interpret because each question maps to a real layer:

  • pages and orders mean buddy-level issues
  • objects and caches mean slab-level issues
  • sleep rules mean GFP and context issues
  • scope means zone, memcg, or NUMA issues
  • latency versus outright failure tells you which recovery path was active

Once those questions are clear, allocator logs stop feeling like a wall of VM jargon. They become evidence about one specific class of pressure in one specific part of the memory stack.

This is also why good allocator debugging feels much less mystical than it first appears. The kernel is usually telling you the truth. The hard part is only knowing which layer of the truth you are looking at.

Stage 51: Allocation Context Is Often More Important Than Allocation Size

Kernel memory problems are frequently misread because people focus on the requested byte count first. In practice the surrounding context often matters more:

  • whether the caller could sleep
  • whether the caller was in interrupt or softirq context
  • whether the request needed DMA reachability
  • whether the request needed physical contiguity
  • whether the request happened inside a memory-cgroup limit

For that reason a small allocation can still be urgent and a larger allocation can still be easy. A tiny GFP_ATOMIC request on a pressured host may be much harder to satisfy safely than a bigger GFP_KERNEL request from a sleeping process that can reclaim and wait.

This point helps explain many allocator warnings that look disproportionate at first glance. The kernel is rarely saying "this many bytes is impossible". It is saying "this kind of memory, in this context, under these rules, is no longer easy to provide".

This is also why allocator debugging improves once you stop asking only "how much memory was requested" and start asking "what promises did the allocator have to keep while satisfying it". The byte count is only one field in the story.

Stage 52: Allocator Warnings Are Compressed Incident Reports

A page-allocation warning or OOM log often looks noisy because it is emitted by low-level code. It is still one of the most information-dense incident artifacts in the system. In a small space it usually tells you:

  • allocation order
  • GFP context
  • zone and node pressure
  • reclaim state
  • which task triggered the event

That makes allocator logs worth reading carefully instead of skimming for only the victim name. They are not generic distress signals. They are short incident reports written from the VM's point of view.

The Core Practical Model

If you keep only a small set of ideas from this topic, keep these:

  1. Physical RAM becomes page frames, grouped into zones.
  2. The buddy allocator manages contiguous page blocks by splitting and coalescing power-of-two buddies.
  3. kmalloc usually rides on slab or SLUB size-class caches backed by pages.
  4. vmalloc gives virtual contiguity, not physical contiguity.
  5. Reclaim tries to free memory, compaction tries to reshape it, and the OOM killer arrives only after those routes stop making enough progress.

Once that model feels natural, allocator logs stop looking mysterious. A high-order failure, a huge slab cache, a compaction storm, and a cgroup-local OOM all become different expressions of the same layered system.

The companion lab focuses on the buddy layer because it is the part people most often misunderstand. You can allocate blocks, watch higher-order free regions split into smaller buddies, then free them and see coalescing rebuild larger orders. That visual makes /proc/buddyinfo much easier to interpret when you encounter the real thing on a live machine.