Why Branch Prediction Always Gets The Last Iteration Wrong
Try the interactive lab for this articleTake the quiz (6 questions · ~5 min)Every for loop in every compiled language, on every modern processor, mispredicts its final branch. A loop that iterates exactly 100 times, called thousands of times with the same trip count, will still pay a pipeline flush penalty on the 100th iteration. The predictor has seen the branch taken 99 times consecutively and has every reason to believe it will be taken a 100th time. When it falls through instead, 12 to 20 cycles of speculative work are thrown away.
This is not a rare corner case or a theoretical curiosity. It is a structural property of how branch predictors work. The mechanisms that make prediction accurate for 99 out of 100 iterations are the same mechanisms that guarantee a miss on the last one. Understanding why requires walking through the hardware, from the conditional jump instruction itself up through saturating counters, pattern history tables, and the specialised loop predictors that modern CPUs use to try (and sometimes fail) to fix this exact problem.
What A Branch Is At The Hardware Level
A branch, to the CPU, is a conditional change in the program counter. On x86, the canonical example is a JNE (jump if not equal) or JL (jump if less) instruction following a CMP or TEST. On ARM, it is a B.NE or B.LT. In both cases, the processor evaluates a condition code (set by the previous arithmetic or comparison instruction) and either continues to the next sequential instruction or jumps to a target address.
Consider a trivial loop compiled from C:
for (int i = 0; i < 100; i++) {
sum += data[i];
}A typical x86 compiler output (with optimisations) produces something like:
.loop:
add eax, dword [rdi + rcx*4]
inc rcx
cmp rcx, 100
jl .loopThe jl .loop instruction is the branch. On every iteration except the last, rcx is less than 100, the condition is true, and execution jumps back to .loop. On the last iteration, rcx equals 100, the condition is false, and execution falls through to whatever follows the loop.
The branch has two properties the hardware cares about: its direction (taken or not-taken) and its target (the address to jump to if taken). Both must be predicted before the branch is even decoded, because the pipeline has already been fetching and decoding instructions from a predicted path for many cycles by the time the branch result is known.
This is where the Branch Target Buffer (BTB) enters. The BTB is a cache, indexed by the instruction address of the branch, that stores the predicted target address and sometimes a prediction of direction. When the fetch stage encounters an instruction at an address that hits in the BTB, it immediately knows "this is a branch, and last time it jumped to address X". The fetch unit can redirect its program counter to X without waiting for the branch to be decoded or executed, keeping the pipeline full.
On Intel's Golden Cove microarchitecture (Alder Lake P-cores and later), the BTB holds around 12,000 entries across multiple levels. AMD's Zen 4 has a similar capacity in its L1 and L2 BTBs. The BTB is critical infrastructure; without it, the processor would have to wait until an instruction was decoded to even know it was a branch, adding several cycles of dead time on every taken branch.
Static Prediction: The Original Heuristic
Before dynamic predictors existed, and as a fallback when a branch is seen for the first time (a "cold" branch with no BTB entry), processors used static prediction rules. The two most common:
Forward branches are predicted not-taken. A forward branch jumps to a higher address, which typically means an if body that is skipped in the common case. The assumption is that the "happy path" falls through, and the branch is only taken on an exceptional condition.
Backward branches are predicted taken. A backward branch jumps to a lower address, which almost always means a loop back-edge. The assumption is that loops iterate multiple times, so the branch is taken far more often than not.
Intel's P6 microarchitecture (Pentium Pro, 1995) used exactly this scheme as its static fallback. The rule is surprisingly effective as a cold-start heuristic: most loops do iterate multiple times, and most forward branches in well-written code do fall through. Measurements from the era showed static prediction achieving around 60 to 70 percent accuracy on typical workloads, which is vastly better than random but far worse than a trained dynamic predictor.
Modern processors still use static prediction as a first-iteration fallback, though the details are proprietary. Intel has not published the static policy for cores after Skylake, but empirical testing by Agner Fog and others suggests that an "always not-taken" static default is used for branches not present in the BTB, with the dynamic predictor overriding it as soon as the branch has executed once.
The limitation of static prediction for loops is obvious. On the first encounter with a loop, the backward branch is predicted taken, which is correct for iteration 1. By the time the branch has executed a few times, the dynamic predictor takes over and continues predicting taken. The exit iteration is still wrong.
Saturating Counters: Learning Direction With Inertia
The simplest dynamic predictor is a counter associated with each branch. The counter tracks recent behaviour and predicts based on the majority direction. Two designs dominate the literature.
1-bit predictor. A single bit per branch: 0 means predict not-taken, 1 means predict taken. After each execution, the bit is set to the actual outcome. This predictor learns immediately but also forgets immediately. A loop that runs 100 iterations will be mispredicted twice per invocation: once on entry (if the bit was left at "not-taken" from the previous invocation's exit) and once on exit (when the bit is set back to "not-taken"). Two mispredictions per loop invocation is expensive, and the entry misprediction is avoidable.
2-bit saturating counter. Four states, typically encoded as:
| Value | State | Prediction |
|---|---|---|
| 0 | Strongly Not-Taken (SN) | Not-Taken |
| 1 | Weakly Not-Taken (WN) | Not-Taken |
| 2 | Weakly Taken (WT) | Taken |
| 3 | Strongly Taken (ST) | Taken |
On a taken outcome, the counter increments (saturating at 3). On a not-taken outcome, it decrements (saturating at 0). The prediction is the high bit of the counter: 0 or 1 predicts not-taken, 2 or 3 predicts taken.
The 2-bit counter's key advantage over 1-bit is hysteresis. A single contrary outcome does not flip the prediction. After 99 taken outcomes, the counter is firmly at state 3 (strongly taken). The exit iteration's not-taken outcome drops it to 2 (weakly taken), but the prediction is still "taken" for the next invocation. The re-entry misprediction is eliminated. The exit misprediction remains.
This is the trade-off that James E. Smith described in his 1981 paper "A Study of Branch Prediction Strategies". The 2-bit counter was his recommendation, and it has been the standard building block of branch predictors for over four decades. Increasing to 3-bit or 4-bit counters adds almost no accuracy for the extra silicon cost; 2 bits is the engineering sweet spot.
Walking Through A 2-Bit Counter On A 100-Iteration Loop
To make this concrete, trace the state of a 2-bit counter for a function that contains a loop running exactly 100 iterations, called repeatedly.
First invocation (cold start):
The branch has never been seen. The counter initialises to some default state (often weakly not-taken, state 1, though this varies by implementation). Assume state 1.
| Iteration | Counter Before | Prediction | Actual | Correct? | Counter After |
|---|---|---|---|---|---|
| 1 | 1 (WN) | Not-Taken | Taken | No | 2 (WT) |
| 2 | 2 (WT) | Taken | Taken | Yes | 3 (ST) |
| 3 | 3 (ST) | Taken | Taken | Yes | 3 (ST) |
| ... | 3 (ST) | Taken | Taken | Yes | 3 (ST) |
| 99 | 3 (ST) | Taken | Taken | Yes | 3 (ST) |
| 100 (exit) | 3 (ST) | Taken | Not-Taken | No | 2 (WT) |
Two mispredictions on the first invocation: the cold-start entry and the exit.
Second invocation:
| Iteration | Counter Before | Prediction | Actual | Correct? | Counter After |
|---|---|---|---|---|---|
| 1 | 2 (WT) | Taken | Taken | Yes | 3 (ST) |
| 2 | 3 (ST) | Taken | Taken | Yes | 3 (ST) |
| ... | 3 (ST) | Taken | Taken | Yes | 3 (ST) |
| 99 | 3 (ST) | Taken | Taken | Yes | 3 (ST) |
| 100 (exit) | 3 (ST) | Taken | Not-Taken | No | 2 (WT) |
One misprediction: the exit. The entry was correct because the counter was at state 2 (weakly taken), which still predicts taken. This is exactly the hysteresis advantage of 2-bit over 1-bit.
Every subsequent invocation follows the same pattern: 99 correct predictions, 1 misprediction on exit. The counter oscillates between states 2 and 3, never dropping below 2 because only one not-taken outcome occurs per 100 taken outcomes.
The exit misprediction is structural. No amount of running the loop again will make it go away. The predictor has learned that the branch is taken, and it will keep predicting taken, because that is the statistically correct thing to do. The one time the branch is not taken, the predictor is wrong.
For short loops, the cost is proportionally higher. A loop of 4 iterations mispredicts once every 4 iterations, a 25 percent misprediction rate on the loop branch. A loop of 1000 iterations mispredicts once every 1000, a 0.1 percent rate. The absolute cost per loop invocation is the same (one pipeline flush), but the amortised cost per iteration varies enormously.
The Misprediction Penalty: What Happens In The Pipeline
When the branch unit resolves a mispredicted branch, the processor must undo all the speculative work that proceeded down the wrong path. This involves several expensive operations.
Pipeline flush. All instructions fetched and decoded after the mispredicted branch, which have been flowing through the pipeline on the assumption that the branch would go a certain way, must be discarded. On a deep pipeline like Golden Cove (roughly 20 stages from fetch to retire for integer operations, more for complex paths), this can mean discarding dozens of instructions in various stages of execution.
Reorder Buffer (ROB) rollback. The ROB tracks all in-flight instructions in program order, allowing the processor to execute out of order but retire in order. On a misprediction, the ROB must walk back to the mispredicted branch, mark everything after it as invalid, and free the physical registers that were allocated to the discarded instructions. The register alias table (RAT) is restored to its state at the branch point, typically using a checkpoint that was saved when the branch was dispatched.
Fetch redirect. The front end must be redirected to the correct path. If the branch was predicted taken but should have fallen through, the correct path is the next sequential instruction. If it was predicted not-taken but should have been taken, the correct path is the branch target (which the BTB likely has cached). Either way, the instruction cache must start filling the pipeline with new instructions from the correct address.
Pipeline drain. Even after the correct instructions start entering the pipeline, there is a gap while those instructions flow through the various pipeline stages before they start producing results and retiring. During this drain period, the backend may be partially idle (though it can still be finishing execution of correct instructions that were issued before the branch).
The total penalty, measured from when the branch resolves to when the first correct instruction after the branch retires, varies by microarchitecture:
| Microarchitecture | Vendor | Approx. Mispredict Penalty (cycles) |
|---|---|---|
| Golden Cove (Alder Lake P-core) | Intel | ~16 |
| Redwood Cove (Meteor Lake P-core) | Intel | ~16 |
| Zen 4 (Ryzen 7000) | AMD | ~13 |
| Zen 5 (Ryzen 9000) | AMD | ~13 |
| Firestorm (Apple M1 P-core) | Apple | ~14 |
| Avalanche (Apple M2 P-core) | Apple | ~14 |
| Cortex-X4 (ARM) | ARM | ~11 |
These numbers come from Agner Fog's microarchitecture manual, Chips and Cheese measurements, and vendor documentation. The range of 11 to 16 cycles is remarkably narrow across vendors, which reflects the fact that pipeline depth is constrained by similar engineering pressures everywhere: deeper pipelines give higher clock speeds but worse misprediction penalties, and 15 to 20 stages is roughly where the trade-off settles.
At 5 GHz on Golden Cove, 16 cycles is 3.2 nanoseconds. That does not sound like much, but consider a tight loop that runs at one iteration per cycle when predictions are correct. A 100-iteration loop takes ~100 cycles, plus 16 cycles for the exit misprediction: a 16 percent overhead. A 10-iteration loop takes ~10 cycles plus 16 for the misprediction: the misprediction costs more than the loop body. For very short loops, the exit penalty dominates execution time.
Pattern History Tables And Global History Registers
A 2-bit saturating counter per branch is the baseline, but modern predictors are far more sophisticated. The key insight, developed through the late 1980s and 1990s, is that branches are correlated: the outcome of one branch often depends on the outcomes of recently executed branches. Consider:
if (x > 0) { // branch A
// ...
}
if (x > 5) { // branch B
// ...
}If branch A is not-taken (x <= 0), branch B will certainly be not-taken as well. A predictor that tracks branch B independently cannot exploit this correlation. A predictor that knows the recent history of branch outcomes can.
The Global History Register (GHR) is a shift register that records the taken/not-taken outcome of the last N branches executed, regardless of their addresses. Typical widths range from 12 to over 100 bits on modern processors. After each branch resolves, its outcome is shifted into the GHR.
The Pattern History Table (PHT) is an array of 2-bit counters indexed by some combination of the branch address and the GHR. The simplest scheme XORs the branch PC with the GHR and uses the result to index into a large table of 2-bit counters. Different branches with different histories map to different counters, allowing the predictor to learn patterns like "when the last three branches went taken-not-taken-taken, this branch is usually taken".
This combination of global history and per-pattern counters is called a two-level adaptive predictor, and it was the dominant design from the mid-1990s through the mid-2000s.
Two-Level Adaptive Predictors: The Yeh-Patt Taxonomy
Tse-Yu Yeh and Yale Patt systematically categorised two-level predictors in their seminal 1991 and 1992 papers. The taxonomy uses two dimensions:
- How the history is collected: globally (one history register for all branches, "G") or per-address (a separate history register for each branch, "P").
- How the pattern table is indexed: a single global table ("g") or a per-address table ("p").
This yields four configurations:
GAg (Global history, global pattern table). One GHR, one PHT shared by all branches. The XOR of the GHR and the branch address indexes into the PHT. This is the cheapest in silicon and the most prone to aliasing (unrelated branches colliding in the table). The Alpha 21264 used a variant of this.
GAp (Global history, per-address pattern tables). One GHR, but a separate PHT for each branch address (or a partitioned table indexed by branch address). Reduces aliasing at the cost of more storage.
PAg (Per-address history, global pattern table). A separate history register per branch, but a shared PHT. The per-branch history captures the pattern of that specific branch (e.g. "this branch alternates taken-not-taken"), but the shared table re-introduces aliasing.
PAp (Per-address history, per-address pattern tables). A separate history register and a separate PHT for each branch. Maximum accuracy, maximum silicon cost. Impractical at scale.
For loop exit prediction, the per-address history schemes (PAg, PAp) have a theoretical advantage: the history register for the loop branch would contain a string of N-1 taken bits followed by a not-taken bit, and the PHT could learn to recognise the specific pattern that precedes the exit. In practice, the history length must be at least as long as the loop count for this to work, which means that a PAg predictor with a 16-bit history can only learn the exit pattern of loops shorter than 16 iterations.
The Alpha 21264's predictor used a hybrid approach: a local (per-address) predictor with 10-bit local history tables combined with a global predictor, with a chooser that selected between them on a per-branch basis. This was state of the art in 1998.
TAGE: The Dominant Modern Design
TAGE (TAgged GEometric history length) predictors, developed by Andre Seznec and published in 2006, are the foundation of branch prediction in every high-performance processor shipping today. Intel adopted TAGE variants starting around Haswell (2013). AMD uses TAGE in Zen and later. Apple's cores use a TAGE-like structure.
The TAGE predictor addresses a critical weakness of fixed-history-length predictors: different branches need different amounts of history to predict well. A simple biased branch (almost always taken) needs zero history. A branch that alternates every 4 iterations needs 4 bits. A branch correlated with something 50 branches back needs 50 bits. A fixed-length history is either too short for some branches or wastefully long for others.
TAGE uses multiple tables, each indexed with a different history length. The history lengths increase geometrically (e.g. 5, 8, 13, 21, 34, 55, 91, 148 for an 8-component TAGE). Each table entry contains:
- A tag (partial match of the hashed branch address + history), used to verify that the entry actually belongs to this branch with this history.
- A prediction counter (typically 3 bits), giving the predicted direction.
- A useful counter (typically 2 bits), indicating how useful this entry has been (used for replacement decisions).
On a prediction, all tables are looked up in parallel. The predictor uses the result from the table with the longest matching history (the longest history that has a tag hit). If no table matches, a base predictor (a simple bimodal table of 2-bit counters) provides the default prediction.
On an update, the matching table's counter is updated with the actual outcome. If the prediction was wrong, the predictor may allocate a new entry in a table with a longer history, effectively "promoting" the branch to a longer history context in hopes that the additional history bits will disambiguate the pattern.
For loop exit prediction, TAGE's geometric history lengths help if the loop count is small enough to fit within one of the history components. A loop of 13 iterations, where the branch pattern is "taken 12 times then not-taken", would be captured by a TAGE component with history length 13 or more. The table entry for this specific 13-bit history pattern would learn to predict not-taken. This is a real improvement over fixed-history predictors.
The limitation is that TAGE's longest history component is finite. A loop of 200 iterations, when the longest component tracks 148 bits of history, cannot be fully distinguished by TAGE. The last 148 bits of history before the exit look identical to the last 148 bits of history during a mid-loop iteration (they are all "taken"). TAGE falls back to the shorter components or the base predictor, which all predict "taken". The exit is mispredicted.
Even for loops within the TAGE history range, there is a subtlety: TAGE must have seen the exit pattern enough times to train the relevant entry. On the first few invocations of the loop, the long-history entry may not yet exist or may not be fully trained. TAGE learns, but it learns progressively, and the learning cost is paid in mispredictions.
Loop Predictors: Hardware That Counts Iterations
Processor architects recognised early that loops are special. The exit pattern (N-1 taken, 1 not-taken, repeat) is perfectly regular and perfectly predictable if the hardware simply counts iterations. This led to dedicated loop predictor hardware, first described in academic work in the late 1990s and shipped in commercial processors since the early 2000s.
A loop predictor sits alongside the main predictor (TAGE or otherwise) and contains a small table of entries, each storing:
- The branch address.
- A trip count register (the learned loop iteration count).
- A current iteration counter (counting how many times the branch has been taken in the current loop invocation).
- A confidence or valid bit, indicating whether the trip count is trusted.
The predictor works as follows:
- When a branch that looks like a loop back-edge (backward branch that is repeatedly taken) is detected, an entry is allocated in the loop predictor table.
- On each taken outcome, the current iteration counter increments.
- When the branch is finally not-taken (the loop exits), the trip count register is set to the value of the current iteration counter, and the current counter resets.
- On subsequent invocations, the loop predictor predicts "taken" for iterations 1 through N-1 and "not-taken" on iteration N, where N is the learned trip count.
- When the loop predictor fires and overrides the base predictor, the exit is predicted correctly.
Intel's Haswell, Skylake, and later cores include a loop predictor as part of their front-end prediction pipeline. It is described in Intel's optimisation manuals and has been empirically confirmed through microbenchmarks. AMD's Zen cores have equivalent functionality. Apple's Firestorm cores also demonstrate loop-exit prediction accuracy consistent with a trip-count-based predictor.
When the loop predictor works, the exit misprediction disappears entirely. A 100-iteration loop, called thousands of times with the same trip count, suffers zero mispredictions once the loop predictor has learned the count. This is the hardware solution to the structural problem described in the saturating counter section.
When The Loop Predictor Fails
The loop predictor is not a universal fix. Several conditions prevent it from working correctly.
Cold starts. The first time a loop is encountered, the trip count is unknown. The loop predictor cannot predict the exit until it has observed at least one complete invocation. The first invocation always mispredicts the exit (via the base predictor). The second invocation may still mispredict if the predictor requires two matching trip counts before trusting the learned value (a common implementation choice to avoid locking in on a fluke).
Variable trip counts. If a loop runs 100 iterations one time and 50 the next, the loop predictor learns the wrong count. It predicts exit at 100, but the loop exits at 50, causing a misprediction. Then it relearns 50, but the next invocation runs 100 again. The predictor thrashes between two counts and mispredicts on almost every invocation. This is worse than having no loop predictor, because the loop predictor is overriding the base predictor with a confidently wrong answer.
Most implementations detect this thrashing and fall back to the base TAGE predictor when the trip count is unstable, but the detection takes a few mispredictions. If the trip count alternates between two values in a regular pattern, some advanced loop predictors can learn the pattern (by tracking two or more trip counts), but this is implementation-specific and not guaranteed.
Short loops. A loop of 2 or 3 iterations may not be recognised as a loop by the loop predictor. The predictor typically requires a minimum trip count (often 4 or more) before it allocates an entry, because very short loops are cheap enough to absorb a misprediction and the loop predictor table is small enough that it should be reserved for loops where the savings matter.
Nested loops. Consider:
for (int i = 0; i < 10; i++) {
for (int j = 0; j < 100; j++) {
sum += data[i][j];
}
}The inner loop branch and the outer loop branch are two different branches at two different addresses. A loop predictor with enough entries can track both. The inner loop's trip count is 100, and the outer loop's trip count is 10. If the loop predictor table is large enough (Intel's has roughly 64 entries), both can be tracked simultaneously.
The complication is that the outer loop's branch is taken once every 100 inner iterations, and not-taken once every 1000 instructions. The predictor must maintain the outer loop's state across the entire execution of the inner loop. If the inner loop evicts the outer loop's entry (because the table is associative and the inner loop's entry maps to the same set), the outer loop loses its learned count and reverts to the base predictor.
Eviction. The loop predictor table is small (tens of entries, not thousands). In code with many loops (a matrix library, a signal processing pipeline), the table can fill up and entries can be evicted. An evicted entry loses its trip count and must relearn, paying the cold-start penalty again.
Branches that are not loop back-edges. The loop predictor only triggers on branches that look like loop back-edges. A hand-written loop that uses forward branches, or a loop transformed by the compiler into a structure that does not resemble a simple back-edge, may not be recognised. Similarly, loops implemented via indirect branches (a computed goto dispatch loop, for example) bypass the loop predictor entirely.
Indirect Branches And Virtual Dispatch
Not all branches are simple taken/not-taken decisions. An indirect branch jumps to an address stored in a register or memory location, and the target can be any one of many possible destinations. The canonical example is a virtual method call in C++ or Java:
obj->process(data); // vtable lookup, indirect callThe compiler generates an indirect call through a function pointer loaded from the object's vtable. Different objects (of different derived classes) have different vtable entries, so the same branch instruction can jump to any one of potentially hundreds of targets.
Standard direction predictors (2-bit counters, TAGE) are irrelevant here; indirect branches are always taken. The problem is predicting the target. The BTB stores the most recent target, which works when the same object type is processed repeatedly (monomorphic dispatch). When the type changes (polymorphic dispatch), the BTB has the wrong target and the processor fetches from the wrong address.
Modern processors include an indirect branch predictor that works similarly to TAGE but predicts targets instead of directions. It uses global history to correlate the indirect branch target with the path that led to it. If the calling code follows a predictable pattern (e.g. always calling type A, then type B, then type C), the indirect predictor can learn the sequence.
This is relevant to loops because many performance-critical loops dispatch on type. An interpreter's main loop, a virtual-method iteration over a polymorphic container, a parser's state machine: all of these have indirect branches inside a loop, and the misprediction cost is compounded by the loop iteration count. A 1000-iteration loop over a heterogeneous container can suffer hundreds of indirect mispredictions if the type sequence is unpredictable, dwarfing the single exit misprediction.
Measuring Branch Prediction With Performance Counters
Modern CPUs expose branch prediction statistics through hardware performance counters. On Linux, the perf tool provides access to these counters.
Consider this test program:
// branch_loop.c
#include <stdio.h>
#include <stdlib.h>
#define ITERATIONS 100
#define INVOCATIONS 10000000
volatile int sink;
int main(void) {
int sum = 0;
for (int inv = 0; inv < INVOCATIONS; inv++) {
for (int i = 0; i < ITERATIONS; i++) {
sum += i;
}
}
sink = sum;
return 0;
}Compile and measure:
gcc -O2 -o branch_loop branch_loop.c
perf stat -e branches,branch-misses ./branch_loopTypical output on a Zen 4 system:
Performance counter stats for './branch_loop':
2,000,234,891 branches
10,014,327 branch-misses # 0.50% of all branches
0.487234127 seconds time elapsedThe numbers tell a clear story. There are roughly 2 billion branches: the inner loop branch executes 100 times per invocation across 10 million invocations (1 billion), plus the outer loop branch (10 million), plus function prologue/epilogue branches, giving roughly 1.01 billion loop-related branches. The measured count is higher because of libc startup, the outer loop itself, and other overhead.
The branch misses, roughly 10 million, correspond almost exactly to one misprediction per invocation of the inner loop. The inner loop runs 10 million times, and each invocation mispredicts its exit once. The 0.50% miss rate looks low in aggregate, but the absolute cost is 10 million multiplied by roughly 13 cycles (on Zen 4), which is 130 million cycles, about 40 ms at 3.2 GHz. That is roughly 8 percent of the total execution time.
If we change the inner loop to 10 iterations instead of 100:
Performance counter stats for './branch_loop':
210,089,445 branches
10,011,892 branch-misses # 4.77% of all branches
0.091827341 seconds time elapsedThe total branches dropped by a factor of 10 (fewer inner iterations), but the branch misses stayed at 10 million because there are still 10 million loop exits. The miss rate climbed to 4.77% and the misprediction penalty is now a larger fraction of total execution time.
If we set iterations to 4:
Performance counter stats for './branch_loop':
90,031,224 branches
9,998,741 branch-misses # 11.11% of all branches
0.048512893 seconds time elapsedAt 4 iterations per loop, the loop predictor may or may not have kicked in (depending on the specific hardware). Even if it has, short loops sometimes fall below the loop predictor's minimum trip count threshold. The miss rate is now over 11%, and the misprediction penalty is a substantial fraction of the total execution time.
These measurements confirm the theoretical analysis: the misprediction cost per loop invocation is constant (one flush), but the cost per iteration decreases as the loop gets longer.
The Cost In Real Workloads
The exit misprediction matters most in code that executes many short loops or makes branch-heavy decisions inside loops.
Sorting. Comparison-based sorting algorithms (quicksort, merge sort, timsort) are branch prediction nightmares. Every comparison is a branch, and the outcome depends on the data. A quicksort partition loop compares each element to the pivot and branches on the result. If the data is random, each comparison is approximately 50/50 taken/not-taken, which no predictor can do better than chance on. The misprediction rate on a quicksort of random data is typically 10 to 15 percent, and misprediction penalties can dominate the sorting time for in-cache datasets.
The loop exit misprediction is a secondary concern here (the partition loop exit is one miss among many data-dependent misses), but it compounds the problem. Every partition call is a short loop (the partition size halves recursively), so the exit miss rate per iteration is high.
Parsers. A hand-written parser or lexer often has a tight loop that scans characters and branches on character class:
while (p < end && is_digit(*p)) {
value = value * 10 + (*p - '0');
p++;
}This loop has two branches: the bounds check (p < end) and the character class check (is_digit(*p)). The character class check is data-dependent and mispredicts on the first non-digit character. The bounds check mispredicts on the last iteration if the string is shorter than the buffer. In a parser processing thousands of tokens, these mispredictions accumulate.
Interpreters. The main loop of a bytecode interpreter dispatches on the opcode of each instruction, typically via a switch statement or computed goto. Each dispatch is an indirect branch with a data-dependent target. The misprediction rate depends on how predictable the opcode sequence is, and for general-purpose bytecode it tends to be 5 to 15 percent. This is the single largest performance bottleneck in interpreters, more so than the actual operation execution. Techniques like threaded code and direct threading exist specifically to make the indirect branches more predictable by giving each opcode its own branch instruction (with its own BTB entry and history) rather than sharing a single dispatch branch.
Search and comparison. Binary search is another branch prediction adversary. Each comparison in a binary search is approximately 50/50 for a random query, and the loop is short (log2(N) iterations for N elements). A binary search in a sorted array of 1024 elements takes 10 iterations, each with a roughly unpredictable comparison branch plus the loop exit branch. The total misprediction cost can exceed the total cache miss cost for in-cache datasets.
Branchless Programming: Removing The Problem
If branch mispredictions are expensive, one approach is to eliminate the branches. This is called branchless programming, and it uses conditional moves, arithmetic tricks, or SIMD instructions to compute results without branching.
The simplest example is a conditional increment:
// Branching version
if (x > threshold) {
count++;
}
// Branchless version
count += (x > threshold);The branching version compiles to a CMP followed by a JLE (jump if less or equal) that skips the increment. The branchless version compiles to a CMP followed by a SETG (set byte if greater) and an ADD. The branchless version has no branch to mispredict, and its execution time is constant regardless of the data pattern. The cost is that it always executes the ADD, even when the condition is false, while the branching version can skip it.
The conditional move instruction (CMOV on x86, CSEL on ARM) is another branchless primitive:
// Branching
if (a > b) {
max = a;
} else {
max = b;
}
// Branchless (using CMOV)
max = (a > b) ? a : b; // Compiler often emits CMOV for thisCompilers are generally good at emitting CMOV for simple ternary expressions, but they are conservative about it. A CMOV introduces a data dependency (the result depends on the comparison, which depends on both inputs), while a well-predicted branch allows the processor to speculatively execute only the taken path with no dependency stall. For highly predictable branches (e.g. a null check that is almost never null), CMOV is slower than the branch because the branch is predicted correctly and the speculative execution hides the latency.
The rule of thumb, confirmed by Agner Fog's measurements: branchless code wins when the branch misprediction rate exceeds about 5 to 10 percent. Below that, the predicted branch is faster because speculation hides the data dependency. Above that, the constant-time branchless execution wins because it avoids the flush penalty.
For loop exit specifically, branchless techniques are usually not applicable. The loop control branch is fundamental to the loop structure; you cannot "branchlessly" decide whether to iterate again. Techniques like loop unrolling reduce the number of branch tests (by doing multiple iterations per branch), and Duff's Device (a famous C hack) eliminates the loop tail entirely, but the loop exit branch remains.
What does help for loop exit is:
Loop unrolling. If the compiler (or programmer) unrolls the loop by a factor of 4, the loop branch is tested every 4 iterations instead of every iteration. A 100-iteration loop becomes 25 iterations of 4 operations each. The exit misprediction still happens once, but the amortised cost per original iteration is reduced by 4x. GCC's -funroll-loops flag enables this, and LLVM applies it automatically at higher optimisation levels.
Software pipelining. On architectures with explicit pipeline control (like Itanium's EPIC, or to some extent through manual scheduling on ARM), the compiler can arrange the loop so that the exit branch is resolved earlier relative to the useful work, overlapping the branch resolution with computation and reducing the effective penalty.
Vectorisation. SIMD (Single Instruction, Multiple Data) instructions process multiple elements per instruction, eliminating per-element branches entirely. A loop that sums an array of 100 integers can use AVX-512 to process 16 integers at a time, reducing the loop from 100 iterations to 7 (with a scalar remainder). The exit misprediction happens once in 7 iterations instead of once in 100, but each iteration does 16x the work, so the amortised cost is negligible.
Historical Context: Without Prediction, Every Branch Is A Bubble
To appreciate what branch prediction does, consider what happens without it. On a simple in-order pipeline with no prediction, every branch causes a pipeline stall (a "bubble") while the branch condition is evaluated and the correct path is determined. On a 5-stage classic RISC pipeline (fetch, decode, execute, memory, writeback), a branch resolved in the execute stage means a 2-cycle bubble on every branch (the fetch and decode stages are wasted while waiting for the branch resolution).
Early RISC processors dealt with this using a branch delay slot: the instruction after a branch is always executed, regardless of whether the branch is taken. The compiler fills this slot with a useful instruction that needs to execute on both paths (or a NOP if no such instruction exists). MIPS, SPARC, and PA-RISC all used delay slots. This hides 1 cycle of the branch penalty but is a permanent architectural wart that complicates everything from compiler code generation to binary translation to speculative execution.
On a deep out-of-order pipeline (15 to 20 stages), a branch resolved in the execute stage would cause a bubble of 10 to 15 cycles on every branch, predicted or not. Without prediction, a program where 20 percent of instructions are branches (typical for compiled code) would spend more time in branch bubbles than in useful execution. A 5 GHz processor would effectively run at the throughput of a 1 GHz processor. Branch prediction is not an optimisation; it is a structural requirement for high-frequency, deep-pipeline designs to achieve anything close to their theoretical throughput.
The accuracy numbers bear this out. Modern TAGE predictors achieve 95 to 99 percent accuracy on general workloads (SPEC CPU benchmarks are typically 97 to 98 percent). At 97 percent accuracy and a 16-cycle penalty, the average cost per branch is 0.97 * 0 + 0.03 * 16 = 0.48 cycles. Without prediction, every branch would cost the full pipeline depth. The predictor turns a 15-cycle-per-branch cost into a 0.5-cycle-per-branch cost, a 30x improvement.
The loop exit misprediction is the residual cost that even 99 percent accuracy cannot eliminate. For a 100-iteration loop, the predictor is 99 percent accurate (99 correct, 1 wrong), which is excellent. For a 4-iteration loop, it is 75 percent accurate on the loop branch (3 correct, 1 wrong), which is mediocre. The loop predictor hardware exists specifically to close this gap for the common case of fixed-trip-count loops.
What The Predictor Cannot Know
There is a philosophical point buried in the engineering. A branch predictor is a pattern-matching machine that observes past behaviour and extrapolates. The loop branch's past behaviour is "taken, taken, taken, taken, ...". The correct extrapolation from this pattern is "taken". The predictor is doing the statistically rational thing, and it is wrong.
The predictor cannot know that this iteration is the last one, because that information is not encoded in the branch's history. It is encoded in the comparison result (i < 100), which has not been computed yet at the time the prediction is needed. The branch's history says "always taken". The comparison result, which will be available 10 to 15 cycles from now, says "not taken this time". The predictor has to guess now, and the rational guess, given the available information, is wrong.
The loop predictor solves this by adding a new kind of information: a trip count. It no longer asks "what did this branch do recently?" but rather "how many times has this branch been taken consecutively, and when does the pattern break?". This is a different computational model, one that counts rather than pattern-matches, and it works precisely because loops have counting structure that pattern history cannot capture efficiently.
This is why the loop predictor is a separate piece of hardware rather than an extension of TAGE. TAGE's geometric history lengths can approximate counting for short loops (by matching very long history patterns), but counting is not what TAGE was designed to do. A dedicated counter is simpler, cheaper, and more accurate for the specific case it targets.
The trade-off is generality versus specialisation. TAGE handles an enormous variety of branch patterns with a single unified mechanism. The loop predictor handles one specific pattern (fixed-count loops) with a specialised mechanism. The overall predictor combines both, using the loop predictor when it matches and falling back to TAGE otherwise. The result is a predictor that is excellent on loops, excellent on correlated branches, and mediocre only on branches that are genuinely data-dependent and unpredictable.
Putting It Together: The Anatomy Of A Loop Misprediction
Walk through the complete sequence of events for a loop exit misprediction on a modern out-of-order core (using Golden Cove as the reference).
Cycle 0. The fetch unit reads the loop branch instruction from the instruction cache. The branch address hits in the BTB, which provides a predicted target (the loop top) and a predicted direction. The TAGE predictor, indexed by the branch address and the global history register, provides "taken" (the 3-bit counter in the longest matching TAGE component is saturated toward taken). The loop predictor, if it has learned the trip count and the current iteration counter has not reached the trip count, also provides "taken". The fetch unit redirects to the loop top.
Cycles 1 to 5. Instructions from the next loop iteration are fetched and decoded. They enter the allocation stage, receive physical registers from the rename unit, and are dispatched to the reservation stations. All of this proceeds speculatively, on the assumption that the branch will indeed be taken.
Cycles 6 to 10. The comparison instruction (cmp rcx, 100) executes in the integer ALU. The result is written to the flags register (or its renamed physical equivalent). The branch instruction reads the flags and determines the actual outcome: not-taken. The branch unit signals a misprediction.
Cycle 11. The misprediction recovery begins. The ROB marks all instructions after the branch as invalid. The register alias table is restored from the branch checkpoint. The fetch unit is redirected to the fall-through address (the instruction after the loop).
Cycles 12 to 16. The pipeline drains. Instructions from the correct path begin flowing through the front end, but the first correct instruction will not reach the execution stage for several more cycles.
Cycle ~27. The first correct post-loop instruction retires. The total penalty from the branch's perspective is approximately 16 cycles (from the misprediction detection to useful retirement of correct-path instructions).
During those 16 cycles, the back end may still be finishing execution of correct instructions that were in the pipeline before the branch (from the last correct loop iteration). The processor does not go completely idle; it is doing useful work on those older instructions while the new correct-path instructions fill the front end. The effective penalty, in terms of lost throughput, depends on how much useful work was already in flight.
On a loop predictor-equipped core where the trip count is learned, cycle 0 plays out differently. The loop predictor's iteration counter reaches the trip count, and it overrides the TAGE prediction with "not-taken". The fetch unit redirects to the fall-through address. No misprediction occurs. The loop runs at full throughput, with zero penalty on exit. This is the ideal case, and it is what happens for the majority of fixed-count loops in steady-state execution.
Summary Of When You Pay And When You Do Not
| Scenario | Exit Misprediction? | Why |
|---|---|---|
| Fixed trip count, warm loop predictor | No | Loop predictor counts iterations and predicts exit |
| Fixed trip count, cold loop predictor (first 1-2 invocations) | Yes | Loop predictor has not yet learned the count |
| Variable trip count | Usually yes | Loop predictor cannot lock on a stable count |
| Trip count within TAGE history range, well-trained | Sometimes no | TAGE may learn the exit pattern for short loops |
| Trip count exceeds TAGE history range | Yes | TAGE cannot distinguish the exit iteration from mid-loop |
| Very short loop (2-3 iterations) | Often yes | May fall below loop predictor's minimum |
| Nested loops, outer loop | Depends | Works if loop predictor table has capacity |
| Loop with indirect branch inside | Exit: depends; indirect: separate problem | Indirect mispredictions inside the loop are distinct |
The loop exit misprediction is not a fixed, unavoidable cost in all cases. On a modern core with a loop predictor, a well-behaved fixed-count loop that runs many times will eventually converge to zero exit mispredictions. The cost is real for cold loops, variable-count loops, and loops that fall outside the loop predictor's capabilities. Understanding which category your hot loops fall into, and measuring with perf stat, is the first step toward knowing whether branch prediction is a bottleneck worth addressing.
The branch predictor is, in the end, an engineering marvel doing the best it can with incomplete information. It sees the past and extrapolates the future. For 99 out of 100 iterations, that extrapolation is correct. For the 100th, it is wrong, and 16 cycles of work are thrown away. Whether that matters depends on how many times per second you execute that 100th iteration, and whether you have arranged your code so that the hardware's specialised loop counter can do its job.