23-04-2026

How CPU Cache Coherence Actually Works

Try the interactive lab for this article Take the quiz (6 questions · ~5 min)

One CPU core writes x = 1. Another core loops until it sees x == 1. If both cores have private L1 caches, how does the second core know the first one changed the value? If both cores read the same cache line and then one core writes a different word inside that line, why does performance collapse even though the variables are logically unrelated? Why does a program that is perfectly correct on one core become intermittently broken or painfully slow on many cores?

The short answer is cache coherence. Every modern multi-core CPU has a hardware protocol that keeps private caches from drifting into contradictory views of shared memory. The long answer is that this protocol works at cache-line granularity, interacts with speculative execution and store buffers, does not by itself solve memory-ordering problems, and becomes expensive when software bounces ownership of a line between cores too often.

Programmers often mix up three separate ideas:

caching, which is about making memory fast by keeping local copies near a core
coherence, which is about making those local copies agree on the latest value of each cache line
consistency, which is about the order in which loads and stores become visible across cores

If you keep those separate, multi-core behaviour becomes much easier to reason about. Coherence tells you whether a write eventually invalidates stale copies. Consistency tells you whether two different writes can be observed in surprising orders. Barriers and atomics sit at that second boundary, not the first.

This article starts with the basic problem and builds up MESI, invalidation traffic, false sharing, store buffers, fences, and the practical implications for real software. The aim is to leave with a working model for why apparently harmless shared-memory code can stall, race, or both, and what the hardware is doing while it happens.

Why Private Caches Create A Correctness Problem

Imagine a two-core CPU with an L1 data cache on each core. Main memory contains:

address 0x1000 = 0

Core 0 loads from 0x1000. The line is brought into Core 0's L1. Core 1 also loads from 0x1000. Now the same cache line exists in both L1 caches.

So far everything is easy. Both cores have read-only copies of the same line and both see value 0.

Now Core 0 stores value 1 to 0x1000. If nothing else happened, Core 0's L1 would hold 1 while Core 1's L1 would still hold 0. The machine would now have two incompatible truths for the same address.

That cannot be allowed. A shared-memory machine needs a rule that keeps caches from diverging permanently. The classic coherence guarantee is:

writes to a single location eventually become visible to other cores
at any given time there is one coherent order of writes to a single cache line
stale copies are invalidated or updated so no core can keep reading an old private value forever

The hardware enforces this at the granularity of a cache line, usually 64 bytes on x86 and many ARM systems. Not at the granularity of a variable, not at the granularity of a page, and not at the granularity of a language object.

That line-sized unit is one of the most important practical facts in multi-core programming. If two threads update two separate integers that happen to live in the same 64-byte line, the hardware treats them as one coherence object. This is the root of false sharing.

Coherence Is About Single-Line Agreement, Not Global Ordering

Before getting into MESI, it helps to pin down what coherence does not promise.

Suppose Core 0 writes:

data = 42;
ready = 1;

Core 1 spins on ready and then reads data.

Coherence ensures that both cores agree on the current value of the ready line and the data line individually. It does not automatically guarantee that once Core 1 sees ready == 1, it must also already see the new value of data, unless the architecture and the program use the right ordering rules.

That is a memory-consistency problem, not a basic coherence problem.

This distinction matters because beginners often hear "the hardware keeps caches coherent" and conclude that all shared-memory communication is therefore safe. It is not. Coherence gives you one necessary property, stale copies do not persist forever for one line. Correct concurrent programs also need ordering guarantees across multiple lines and across multiple instructions.

MESI is about the first problem. Fences, acquire/release operations, and architecture-specific ordering rules address the second.

The Simplest Useful Model, One Cache Line And The MESI States

The classic protocol taught first is MESI:

M, Modified
E, Exclusive
S, Shared
I, Invalid

These are the per-core states for one cache line.

Modified means this cache has the only valid copy and it differs from memory. The line is dirty. If another core wants it, this core must supply or write back the latest data.

Exclusive means this cache has the only valid copy, but it matches memory. No other cache currently holds it. If the core writes the line, it can move to Modified locally without asking anyone else because nobody else has a copy to invalidate.

Shared means this line may exist in multiple caches and matches memory. Reads are fine. A write cannot happen silently because all other shared copies would become stale.

Invalid means this cache does not have a usable copy.

The state machine is easiest to understand through a timeline.

Read miss with no other sharers

Core 0 reads address A
its L1 misses
the coherence fabric checks whether any other cache has the line
nobody does
the line is fetched from a lower cache or memory into Core 0 in Exclusive

Second reader arrives

Core 1 reads A
its L1 misses
the fabric sees Core 0 already has the line in Exclusive
both caches transition to Shared
both can now read it

One core writes

Core 0 wants to write A
the line is currently Shared
Core 0 issues an ownership request, often called a Read For Ownership or RFO
all other sharers must invalidate their copies
once acknowledgements come back, Core 0 transitions to Modified
Core 1's copy becomes Invalid

At that point Core 0 can write locally.

This invalidation step is the core of coherence traffic in write-heavy multi-core code.

Read For Ownership Is The Real Price Of Writing Shared Data

People often say "a store is cheap because it hits in L1". That is only true for data the core already owns in a writeable state. If the line is Shared or Invalid, the core first has to acquire ownership.

That ownership request can involve:

checking whether some other core has a dirty copy
invalidating all shared copies
waiting for acknowledgements
possibly receiving the latest data from another cache rather than from memory

This is why writing to a line other cores recently touched can be much slower than writing to a private line in a tight loop.

The line is not merely "in cache". It must be in the right coherence state.

On x86 a typical store to a line already in Modified can retire quickly into the store buffer. A store that needs an RFO may stall much longer because the core has to win exclusive ownership first. The latency depends on topology, nearby contention, and whether the latest copy lives in another core's cache, the shared last-level cache, or memory.

This is the hardware mechanism behind lock contention and false sharing. Ownership of the line keeps bouncing.

A Concrete Two-Core MESI Walkthrough

Take one 64-byte line containing a simple integer counter.

Step 1, Core 0 reads `counter`

Initial state:

memory line L = 0
Core 0 L1: Invalid
Core 1 L1: Invalid

After Core 0's read:

Core 0 L1: Exclusive, value 0
Core 1 L1: Invalid

Step 2, Core 1 reads `counter`

Now both caches hold clean copies:

Core 0 L1: Shared, value 0
Core 1 L1: Shared, value 0

Step 3, Core 0 increments `counter`

Core 0 needs write ownership:

Core 0 sends RFO for line L
Core 1 invalidates its Shared copy
Core 0 transitions to Modified
Core 0 writes value 1

State now:

Core 0 L1: Modified, value 1
Core 1 L1: Invalid
memory: still 0 or stale until writeback

Step 4, Core 1 reads `counter`

Core 1 misses. The latest value is in Core 0's Modified line, not in memory. The coherence system arranges a transfer:

Core 0 supplies or writes back line L
Core 0 becomes Shared
Core 1 becomes Shared
both now see value 1

That is coherence in action. Memory may lag temporarily while the current truth sits in one dirty cache. The hardware still preserves a single coherent view for all cores.

Real Machines Add More States Than MESI

MESI is the basic classroom model. Real processors often extend it.

Intel and AMD commonly use MESIF or MOESI-like protocols internally.

MOESI adds:

O, Owned

Owned means one cache holds the dirty authoritative copy, but other caches may also hold Shared copies. This can reduce unnecessary writebacks to memory because the owner can supply data directly to requesters while memory remains stale.

MESIF adds:

F, Forward

Forward designates which shared cache should respond to future reads, reducing duplicate replies when several sharers exist.

You do not need to memorise every variant to write software, but it is useful to know that the real hardware often optimises around the basic MESI picture. The core truths remain:

only one writer owns the line at a time
readers can share clean copies
stale copies must be invalidated or made subordinate to one up-to-date owner

Snooping Versus Directory Tracking

How does the machine know which other cores have a line? There are two broad styles.

Snooping systems

Older and smaller systems often use a broadcast-style snoop mechanism. When one core wants ownership, the request is visible to all relevant caches. Each cache checks whether it has the line and responds accordingly.

This is simple and works well for modest core counts and shared buses or rings. It becomes less attractive as the system grows because broadcast traffic scales poorly.

Directory-based systems

Larger systems often track sharers in some form of directory. Instead of asking every cache, the coherence fabric consults metadata telling it which cores or slices currently hold the line. Invalidations then target only those participants.

Modern many-core CPUs and multi-socket systems rely on increasingly directory-like designs because pure broadcast does not scale cleanly.

Software usually sees none of this directly, but the topology matters for performance. A line bouncing between two hyperactive threads on adjacent cores is one thing. The same line bouncing across sockets with a directory lookup and inter-socket invalidation traffic is much slower.

The Shared Last-Level Cache Does Not Remove The Need For Coherence

Some programmers initially assume that if cores share an L3 cache, coherence must become trivial. It does not.

Private L1 and often private L2 caches still hold local copies. The shared L3 may act as a directory, a victim cache, or a backing store for clean lines, but each core still needs coherent rules for:

whether its local copy is valid
whether another core owns a dirty version
when invalidations must arrive
when the core can write without further permission

The shared L3 helps coordination and reduces some traffic, but the private-cache coherence problem remains.

False Sharing, The Performance Disaster That Looks Innocent In Source Code

False sharing happens when two threads write different variables that happen to sit in the same cache line.

Consider:

struct Counters {
    uint64_t requests;
    uint64_t errors;
};

Thread 0 updates requests. Thread 1 updates errors. The variables are logically unrelated, but if they live in the same 64-byte line the coherence hardware cannot treat them independently. The entire line must move into Modified state on whichever core writes next.

The resulting pattern looks like:

Core 0 owns line L, writes requests
Core 1 wants to write errors, sends RFO, invalidates Core 0
Core 1 owns line L, writes errors
Core 0 wants to write requests again, sends RFO, invalidates Core 1
repeat thousands or millions of times

No data race is needed. The program can be perfectly correct and still crawl because ownership of one line keeps ping-ponging between cores.

This is one of the most important practical consequences of coherence being line-granular rather than variable-granular.

A typical mitigation

struct alignas(64) PaddedCounter {
    uint64_t value;
    char pad[56];
};

Or:

struct Counters {
    alignas(64) uint64_t requests;
    alignas(64) uint64_t errors;
};

Padding pushes the counters onto different lines so each core can keep its own hot line in Modified state without fighting the other core.

This costs memory, but the performance improvement can be enormous on write-heavy paths.

A European Example, Telemetry Counters In A Paris Edge Service

Imagine a packet-processing service in Paris handling requests at a major CDN edge. Each worker thread updates per-thread counters for packets, drops, and retries. An engineer decides to keep these counters in one compact array because it looks cache-friendly.

The source code is simple:

counters[thread_id].packets++;
counters[thread_id].drops++;

If each thread's structure is small and the array elements pack tightly, several threads' counters may share the same line. Throughput falls even though each thread mostly touches its own slot. The line containing adjacent counters bounces between cores. Hardware performance counters show heavy HITM traffic, "hit modified in another core", and the service spends time arbitrating ownership rather than forwarding packets.

The fix is not an algorithmic rewrite. It is a layout change, pad or align each thread's hot counters to a full line.

This kind of issue is common enough that serious high-throughput software engineers think about cache-line layout as part of API design, not as an afterthought.

Coherence Traffic Also Explains Why Spinlocks Can Melt Down

A simple spinlock usually involves one shared word:

while (atomic_exchange_explicit(&lock, 1, memory_order_acquire) == 1) {
    // spin
}

Every contending core repeatedly reads and tries to write the same lock line. The winner takes ownership and writes 1. Losers keep rereading or retrying. When the lock is released, another core acquires ownership. Under contention the lock line becomes the hottest coherence object in the system.

This is why scalable locking schemes try to reduce global bouncing:

ticket locks make ordering fair but still create a hot line
MCS and CLH locks reduce contention on one shared line by giving waiters local spinning locations
sharded counters and per-CPU structures avoid central shared words where possible

The problem is not abstract "contention" alone. It is the concrete cost of moving a cache line's ownership across cores again and again.

Store Buffers Mean A Core Can "Write" Before The Rest Of The Machine Sees It

Modern cores do not wait for every store to become globally visible before moving on. Stores usually enter a store buffer. From the writing core's point of view the value may already appear committed, because loads from the same core can often forward from the buffered store. Other cores may not see the store yet.

This is crucial for performance. If every store had to complete full coherence negotiation and drain to the cache hierarchy before retirement, cores would stall constantly.

It is also crucial for understanding memory ordering. A core can have:

store A sitting in the store buffer
store B issued afterwards
some loads and branches executing speculatively

The architecture defines rules for what other cores may observe and when fences or atomic operations force ordering constraints.

Coherence still ensures eventual agreement on each line. Store buffers explain why visibility is not instantaneous and why one core can temporarily reason with a fresher private view than the rest of the machine.

Invalidation Queues And Why Visibility Has Latency

Receiving an invalidation is also not magic. A core may need to:

notice the coherence request
invalidate or downgrade a line in its local cache
confirm completion

Real implementations use queues and internal buffering because the coherence protocol is part of a high-frequency out-of-order machine, not a one-step cartoon. This introduces latency. A write may need to wait for invalidation acknowledgements before the writer fully owns the line. That waiting time is part of the cost of sharing.

Software sees this as:

variable lock acquisition latency
throughput collapse under false sharing
performance cliffs when threads move across sockets

The machine is still coherent. It just pays a real finite protocol cost to stay that way.

Coherence And Consistency Meet At Fences And Atomics

Once the hardware can keep single lines coherent, software still needs ways to communicate ordered facts across lines and across time.

The classic message-passing pattern:

data = 42;
flag.store(1, std::memory_order_release);

Reader:

while (flag.load(std::memory_order_acquire) == 0) { }
use(data);

The release on the writer side and the acquire on the reader side create an ordering edge. If the reader sees flag == 1, it must also see the preceding write to data.

Without the right ordering primitives, coherence alone may leave room for surprising observations on weaker architectures. ARM and RISC-V are more relaxed than x86, so correct portable code uses language-level atomics rather than relying on x86 habits.

This is where multi-core programming becomes genuinely hard. The hardware will eventually make each line coherent. The programmer still has to express which writes must be visible before which flags and which loads may or may not move.

x86 Is Fairly Strong, But Not Strong Enough To Ignore Ordering

x86 uses Total Store Order, or TSO, which is stronger than ARM's default model. Many simple patterns that fail on ARM work on x86 by accident or by architectural guarantee. Stores become visible to other cores in program order. Loads are not freely reordered after older loads the same way they can be on weaker models. This makes x86 concurrency feel forgiving.

It is still a mistake to ignore ordering formally.

First, compilers can reorder ordinary non-atomic operations even when the hardware would not.

Second, store buffers mean visibility is still delayed and some patterns still require fences or locked instructions.

Third, portable code cannot assume x86 forever. A lock-free queue correct only on x86 is not correct in the broader language memory-model sense.

The right discipline is to use the language's atomic operations and think in acquire, release, and sequentially consistent edges unless you are doing deep architecture-specific work.

Memory Barriers Are Not About "Refreshing The Cache"

A common wrong mental model is that a fence "flushes caches" or "forces the CPU to update RAM". That is not usually what it means.

A memory barrier primarily constrains ordering:

before which later operations may proceed
when prior stores must become visible relative to later loads or stores
how speculation or buffering must be limited

The exact semantics depend on architecture and barrier type:

load fence
store fence
full fence
acquire
release
sequentially consistent

On x86 many atomic RMW operations such as lock xadd already imply strong ordering. On ARM explicit barriers like dmb ish are common. At the language level, std::atomic_thread_fence and per-operation memory orders expose these rules more portably.

Thinking of a fence as "refresh cache" leads to bad code. Thinking of it as "publish prior work before allowing later observation" is much closer.

Read-Modify-Write Atomics Are Coherence Hotspots By Design

An operation like:

atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);

Still needs exclusive ownership of the line containing counter. Even with relaxed ordering, the core must acquire the line in a writeable state and perform the read-modify-write atomically.

If many threads hammer one atomic counter, the bottleneck is often not the arithmetic. It is the coherence protocol serialising ownership of that line.

This is why scalable systems use:

per-thread counters
per-CPU counters
batching before publishing
reduction phases rather than one central atomic in the hot path

The atomic is correct. The line is still one coherence object.

Hyper-Threading Does Not Remove Coherence Costs

Two logical threads sharing one physical core can share L1 more directly than separate cores, so some forms of ownership transfer are cheaper when communicating between siblings on the same core complex. This does not remove the coherence problem for the rest of the machine.

Once threads run on different physical cores, and especially different sockets, line migration becomes much more expensive. Schedulers therefore matter. Pinning two heavily communicating threads close together can reduce coherence latency. Letting them float unpredictably across the machine can make line bouncing worse.

This is one reason performance engineers use affinity controls:

rtk proxy taskset -c 2,3 ./worker-benchmark
rtk proxy numactl --cpunodebind=0 --membind=0 ./worker-benchmark

These tools do not change the coherence protocol. They change how far the line has to travel and which caches tend to own it.

Multi-Socket Systems Make Ownership Transfer More Expensive Still

On a dual-socket server, a modified line owned by a core on socket 0 may need to move or be observed by a core on socket 1. Now coherence traffic crosses the inter-socket fabric. Latency rises. Bandwidth is more constrained. Directory lookups may involve remote slices or home agents depending on the design.

For software this means:

lock contention across sockets is worse than within one socket
false sharing across sockets is especially toxic
sharding by NUMA node is often worth the complexity

A database server in Amsterdam with worker threads and memory localised per socket can outperform a naive global-queue design by a wide margin partly because it reduces remote ownership transfers, not only because it improves DRAM locality.

Prefetching And Speculation Do Not Bypass Coherence

Out-of-order execution, hardware prefetchers, and speculative loads make cores faster, but they still have to obey coherence rules. A speculative load may bring a shared line in early. A prefetcher may fetch data into cache before the program explicitly uses it. If another core later wants to write that line, the sharer still has to invalidate it.

This matters because:

speculative or prefetched sharing can increase invalidation traffic
hardware can create coherence participants even before the source code "logically" uses the value

The machine remains coherent. It just means the visible performance effects can look surprising if you imagine coherence only engages on explicit loads and stores in program order.

Measuring Coherence Problems On Linux

Linux exposes performance counters that make coherence pain visible.

General tools:

rtk proxy perf stat -e cache-references,cache-misses ./bench
rtk proxy perf stat -e cycles,instructions,LLC-loads,LLC-stores ./bench

More specific cache-to-cache analysis:

rtk proxy perf c2c record ./bench
rtk proxy perf c2c report --stdio

perf c2c is especially useful for false sharing and contended cache lines. It can highlight lines that generate many HITM events, loads that hit lines modified in another core, a strong clue that ownership bouncing is hurting you.

For thread placement and NUMA:

rtk proxy numastat -p $PID
rtk proxy taskset -pc $PID

These tools do not magically solve concurrency problems, but they turn "the multi-core version is slow" into concrete evidence about which lines are hot and where they are moving.

A Small False-Sharing Benchmark

This stripped-down C++ example is enough to trigger the problem:

#include <atomic>
#include <thread>
#include <vector>
 
struct CountersBad {
    std::atomic<uint64_t> a{0};
    std::atomic<uint64_t> b{0};
};
 
struct alignas(64) CounterPad {
    std::atomic<uint64_t> value{0};
};
 
struct CountersGood {
    CounterPad a;
    CounterPad b;
};
 
template <typename T>
void run(T& counters) {
    std::thread t1([&] {
        for (uint64_t i = 0; i < 100000000; ++i) counters.a.value.fetch_add(1, std::memory_order_relaxed);
    });
    std::thread t2([&] {
        for (uint64_t i = 0; i < 100000000; ++i) counters.b.value.fetch_add(1, std::memory_order_relaxed);
    });
    t1.join();
    t2.join();
}

The unpadded version often runs much slower because the two counters live in one line. The padded version consumes more memory but lets each thread own a separate line.

The key lesson is that memory_order_relaxed removes extra ordering guarantees, but it does not remove the need to acquire line ownership for each atomic increment. Coherence traffic remains.

Language Memory Models Sit On Top Of Coherence, Not Instead Of It

C++, Rust, Java, and Go all give programmers concurrency primitives with specific visibility and ordering rules. Those language guarantees rely on the hardware coherence protocol but go beyond it.

For example:

C++ atomics map to architecture-specific instructions and fences
Rust uses LLVM's memory model and then architecture-specific lowering
Java's volatile and monitor semantics provide well-defined happens-before edges

The runtime or compiler cannot ignore coherence because all these abstractions ultimately operate on cache lines moving through hardware states. The language model tells you what behaviour is legal. Coherence and consistency hardware make it possible.

This is why "the CPU keeps caches coherent" is only the start of the story. Safe concurrent programming lives at the point where language semantics meet hardware protocol costs.

What Coherence Means For Lock-Free Data Structures

Lock-free algorithms often reduce blocking, but they do not escape coherence traffic. A compare-and-swap on a head pointer still needs exclusive ownership of the line containing that pointer. Under heavy contention, many lock-free structures degrade because all contenders hammer the same line with failed CAS attempts.

Good lock-free designs therefore still try to:

spread hot metadata across lines
reduce global shared words
avoid one central head or tail when possible
batch local work before publishing to a shared queue

Lock-free does not mean coherence-free.

The Practical Rules Engineers End Up Following

A few practical rules survive repeated contact with real systems.

Keep write-heavy per-thread or per-core data on separate cache lines.

Assume one global atomic counter in a hot path will eventually become a bottleneck.

Think about NUMA placement if a hot line or lock crosses sockets.

Use proper language-level atomics and ordering, not plain loads and stores plus hope.

Treat padding and alignment as performance tools, not superstitions.

Measure with perf c2c or equivalent before guessing.

Remember that coherence is line-granular. Source-level independence is not enough.

These rules sound mundane because the hardest concurrency bugs and performance cliffs often come from mundane facts ignored too long.

The Right Mental Model

Each core has private fast caches. The machine still needs one coherent truth for every cache line. The coherence protocol provides that by letting many cores share read-only clean copies, while ensuring only one core at a time can own a line for writing. Getting that ownership requires invalidating or downgrading other copies. This protocol traffic is the hidden cost behind shared writable memory.

MESI is the simplest good model:

Modified, one dirty owner
Exclusive, one clean owner
Shared, many clean readers
Invalid, no usable copy

False sharing happens because the line is the unit of coherence. Barriers matter because coherence does not define global ordering across lines. Store buffers matter because visibility is not instantaneous. Multi-socket systems hurt more because ownership transfer has farther to travel.

Multi-core programming is genuinely hard for this reason. The programmer writes in terms of variables and algorithms. The hardware arbitrates ownership and visibility in 64-byte chunks moving through a protocol under finite latency. The code is correct only when it respects the ordering model, and it is fast only when it minimises needless line bouncing.

The Difference Between Coherence Traffic And Ordinary Cache Misses

A plain cache miss and a coherence stall can both look like "waiting on memory", but they are not the same event.

A plain miss usually means the core needs data that is not in its local cache hierarchy at all, or not at the current level. The request travels downward, perhaps to the shared last-level cache and then DRAM.

A coherence stall usually means the data exists, but another core currently owns the right to modify it or still holds a stale copy that must be invalidated before this core can proceed. The line may be physically close while logically unavailable.

This is an important performance distinction. If a workload is bottlenecked on ordinary misses, the remedies tend to involve:

improving locality
reducing working-set size
blocking or tiling data access
using better prefetch behaviour

If the workload is bottlenecked on coherence, the remedies are different:

reduce write sharing
pad or align hot fields
shard counters
reduce global locks
keep communicating threads topologically close

Without that distinction people often chase the wrong cache problem. A hot atomic counter can fit comfortably in L1 all day and still behave badly because the line does not stay owned by the same core for long.

Sequential Consistency Is Easy To State And Expensive To Approximate

The easiest memory model to describe is sequential consistency. It says all threads observe all memory operations as if there were one total global order consistent with each thread's program order. It is the mental model many programmers naturally assume before they have studied concurrency.

Real CPUs do not execute that model directly because it would waste too much performance. Store buffers, speculative execution, out-of-order retirement constraints, and overlapping cache traffic all exist because strict global ordering would be too expensive.

Language-level sequentially consistent atomics therefore usually compile to stronger instruction sequences or stronger fence behaviour than acquire/release or relaxed operations. They are easier to reason about, but the hardware has to do extra coordination work to approximate that simpler mental model.

This matters for coherence because strong ordering often forces the machine to expose ownership and visibility changes more conservatively. A line may already be coherent in the basic MESI sense, but a stronger fence or locked instruction can still constrain when the core may move on or when other cores are allowed to observe the result.

So when programmers choose weaker orders they are not "turning coherence off". They are deciding how much global ordering they need on top of a still-coherent per-line machine.

Release And Acquire Are Best Understood As Publication Rules

One of the clearest concurrency patterns is publish then observe.

Writer:

payload = new_value;
flag.store(1, std::memory_order_release);

Reader:

while (flag.load(std::memory_order_acquire) == 0) { }
consume(payload);

The release store says, in effect, "before other threads are allowed to observe this flag becoming 1 through a matching acquire, all prior writes in this thread must already be visible in the required way." The acquire load says, "once I see the published flag, later operations in this thread must not move before it."

Coherence is still doing the line-level maintenance. The release and acquire semantics define what that coherent visibility means across more than one line and more than one instruction.

If you try to reason about release and acquire as cache refresh commands, the model stays muddy. If you reason about them as publication and observation rules, the model becomes far cleaner.

One Line Can Become A Distributed Queue Of Waiting Writers

A heavily contended line is not just moving magically between cores. In practice, many cores may line up requesting ownership. The coherence fabric, directories, queues, and arbitration logic have to serve those requests in some order while preserving correctness.

This can create visible unfairness or latency spikes:

one core wins ownership repeatedly because it already sits near the home slice
another core keeps missing the timing window and suffers repeated retries
a lock handoff that looked cheap in microbenchmarks becomes noisy once more contenders arrive

This is one reason scalable queue locks and per-core data structures help so much. They reduce the number of writers competing for one coherence object at the same time. The line stops being a globally contested rendezvous point.

Cache-Line Placement Becomes Part Of Data-Structure Design

At small scale, programmers think in fields and pointers. At high-performance multi-core scale, serious engineers often also think in cache lines.

For example, a ring buffer may place:

producer head on one line
consumer tail on a different line
bulk data on following lines

The point is not only locality. It is keeping independent write ownership separate. If producer and consumer indices share one line, every push and pop can trigger unnecessary invalidation traffic even when the algorithm itself has minimal logical contention.

The same idea appears in:

per-core run queues in kernels
sharded allocator arenas
lock metadata separated from read-mostly object headers
NIC descriptor rings aligned to reduce producer-consumer bouncing

This is where software layout starts to acknowledge the hardware's actual coherence unit.

Why Reader-Writer Patterns Often Work Better Than Symmetric Writers

Shared-read, rare-write patterns fit coherence protocols well. Many cores can hold a line in Shared state and read it cheaply. The cost only spikes when a writer arrives and needs exclusivity.

This is why designs that separate:

one publisher, many readers
mostly immutable metadata, occasional updates
read-copy-update style publication

often scale better than designs with many peer writers hammering the same structure.

Read-copy-update, or RCU, is a good example from kernel engineering. Instead of forcing readers to fight over mutable shared state, an updater often builds a new version and then publishes a pointer change, while old readers finish against the old view. This still relies on coherence and ordering, but it drastically changes the shape of write ownership traffic. The hot path becomes mostly reads plus occasional publication rather than constant symmetric mutation of one shared line.

The point is not that RCU "beats coherence". It works well partly because it cooperates with the way coherent caches handle many readers and fewer writers.

DMA And IO Coherence Add Another Layer

CPUs are not always the only agents touching memory. Devices such as NICs, GPUs, and storage controllers may perform DMA into shared memory. Now the machine needs some story about coherence between CPU caches and device-visible memory too.

On some systems and for some mappings, the platform is IO coherent. Device writes become visible to CPUs with the expected cache-management rules. On other paths the software must use explicit cache maintenance, non-cacheable mappings, or synchronisation calls so the CPU does not keep reading a stale cached line while the device has already updated memory beneath it.

This matters in systems programming because the word "coherence" suddenly has two scopes:

coherence between CPU cores
coherence between CPUs and DMA-capable devices

Kernel APIs such as DMA mapping functions exist partly to make these rules explicit. The CPU-core MESI story is only one part of the broader shared-memory reality of a machine.

Directory Home Slices And Mesh Fabrics Matter In Modern Servers

Large server CPUs no longer look like one simple bus with a few cores hanging off it. They often use mesh fabrics, distributed LLC slices, and home-agent logic deciding where a given line's directory information resides.

That means the cost of touching a line can depend on more than just "same core" or "different core". It can also depend on:

which LLC slice owns the address
where the requesting core sits relative to that slice
whether another socket currently owns the dirty copy
how congested the fabric is

Most programmers do not need to know the exact map for a specific Sapphire Rapids or EPYC generation. It is still useful to know that the coherence path is a real network inside the processor package. That is one reason some shared-memory bottlenecks scale badly as core count rises. The protocol traffic is now contending on an on-die transport, not just on local tag checks inside one core.

Why Benchmarks Change When Threads Migrate

A benchmark that looks stable with pinned threads can become noisy when the scheduler migrates threads across cores. Migration changes:

which private caches still contain relevant lines
which socket owns recent dirty data
where future invalidations have to travel

A lock or shared counter that was bouncing within one core cluster may now bounce across a wider topology. This is one reason production latency regressions sometimes appear after innocuous scheduler or container-placement changes. The algorithm is unchanged. The ownership path for its hot lines is not.

For diagnosis it often helps to compare:

rtk proxy taskset -c 4,5 ./bench
rtk proxy taskset -c 4,28 ./bench

The first might keep communication on nearby cores. The second might cross a NUMA boundary or a more distant part of the mesh. A large difference strongly suggests coherence or locality effects rather than pure instruction throughput.

A Litmus Test, Seeing Flag And Data In The Wrong Relationship

Consider the classic message-passing mistake with plain non-atomic variables:

// writer
data = 42;
flag = 1;
 
// reader
while (flag == 0) { }
printf("%d\n", data);

Programmers often imagine coherence makes this safe. One line contains flag, another contains data. The writer updates both. The reader waits for one and then reads the other.

The problem is that coherence only ensures each line eventually converges. It does not force the two lines to become visible in the order the programmer intended across all architectures and all compiler optimisations. The reader might observe flag == 1 while still seeing the old data unless the program uses atomics or barriers with the right semantics.

This is not a defect in coherence. It is a misuse of it. The hardware kept both lines coherent individually. The programmer forgot to express the cross-line ordering edge.

Once you grasp this example clearly, a lot of concurrency advice stops feeling arbitrary. Acquire and release are not ceremony. They are the language used to describe publication on top of a line-coherent machine.

Throughput Scaling Often Stops Where One Shared Line Starts

If you ever see a multi-threaded benchmark scale nicely from one to two threads, modestly to four, and then flatten or regress at eight and beyond, ask whether one shared line sits in the middle of the hot path.

Common culprits:

one global queue head
one allocator lock
one stats counter
one reference count
one cache line containing several independently updated flags

At low thread counts the protocol cost may be tolerable. At higher counts the line becomes the serialising point for the whole workload.

This is one reason serious scalable system design often looks architecturally repetitive:

sharded maps
per-core freelists
hierarchical aggregation
batched publication
readers that work from local snapshots rather than shared mutable state

All of these patterns are, among other things, attempts to stop one cache line from becoming the machine's busiest negotiation object.

What A Coherence-Friendly Design Usually Looks Like

When a concurrent design scales cleanly, it usually has a recognisable hardware profile:

readers outnumber writers on the hottest shared structures
ownership of any one line changes hands infrequently
publication happens through a small number of explicit synchronisation points
write-heavy state is sharded by thread, core, socket, or queue
layout choices keep unrelated hot fields off the same line

That profile does not remove complexity, but it aligns the software with the way coherent caches actually behave. The machine handles many shared readers well. It handles occasional ownership transfer acceptably. It struggles when many peers insist on rewriting one line as fast as they can.

This is why good multi-core engineering often looks like a sequence of separation moves. Separate counters. Separate queues. Separate ownership domains. Reduce the places where all threads must negotiate over one 64-byte object.

Once that principle is visible, performance advice that once sounded folkloric starts to look consistent. Align this struct. Pin these threads. Shard this map. Batch these updates. The common goal in each case is to leave the coherence protocol less work to do in the hottest parts of the program.

Per-CPU Data Structures Work Because They Change The Ownership Pattern

Operating systems and high-performance runtimes often solve coherence pressure by giving each CPU or worker its own local copy of hot state. Linux per-CPU counters are a classic example. Instead of one global integer that every core increments, each core updates a local counter that tends to stay owned by that core. A slower aggregation path sums the local values when a global answer is needed.

This looks like extra complexity in software, but the hardware reason is straightforward. The machine is much happier if:

Core 0 keeps writing one line it already owns
Core 1 keeps writing a different line it already owns
a rare reader later sums the values

than if every core repeatedly contends for one shared line.

Queue designs often use the same trick in a different shape. A producer may mostly update one tail index and a consumer may mostly update one head index. If those indices are separated onto different lines, the queue avoids needless ping-pong. The payload slots themselves may move through ownership once per item, which is expected. The metadata no longer creates extra fighting on top.

You can think of these designs as respecting the protocol's preferences:

keep write ownership stable
keep readers mostly read-only
aggregate occasionally rather than negotiating constantly

That is not an aesthetic preference. It is a direct response to the fact that coherence is a finite-latency protocol operating on cache lines rather than a free background service.

It also explains why postmortems for scalability regressions often end with surprisingly humble fixes. A field gets padded. A queue index moves. A counter becomes per-core. The algorithm may remain almost identical while the ownership pattern changes enough that the hardware stops spending so much time arbitrating one shared line.

Coherence Literacy Makes Optimisation Less Superstitious

One useful side effect of understanding coherence is that performance work becomes less mystical. Instead of saying "the cache is bad" or "threads fight somehow", you can ask sharper questions:

which exact line is moving
who owns it most of the time
how often does ownership transfer
are readers sharing clean copies or are writers forcing invalidations constantly

Those questions lead to better fixes because they connect the source code, the data layout, and the hardware protocol directly. Multi-core optimisation is still hard, but it stops feeling like folklore once the cache line becomes the main character rather than an invisible implementation detail.

That shift in viewpoint is valuable even when the answer is "this code is fine". It lets an engineer rule coherence out for defensible reasons instead of by guesswork, which is often half the battle in production performance work on large shared-memory systems, high-core-count servers, and dense runtime workloads. It saves time later.

Once you see that, a lot of strange behaviour stops being strange. A compact struct can be slower than a padded one. One atomic counter can limit a whole system. A lock that looks trivial in source can become a fabric-wide ownership storm. A flag can become visible before the data it was meant to publish unless the ordering edge is explicit.

Cache coherence is the reason multi-core shared memory works at all. It is also one of the reasons it hurts when software is careless.