← Back to Logs

Why Your CPU Has So Many Execution Units It Will Never Fully Use

Try the interactive lab for this articleTake the quiz (6 questions · ~5 min)

Open the microarchitecture diagram of any modern high-performance CPU core and count the execution units. Intel's Golden Cove (the performance core in 12th and 13th gen Alder Lake and Raptor Lake) has twelve execution ports feeding a mix of integer ALUs, vector units, load and store pipelines, and branch units. AMD's Zen 4 has a similar count. Apple's Firestorm core (M1 through M3 family) has even more, with its eight-wide decode feeding into a forest of functional units that no single-threaded workload can keep fully occupied.

Now measure what fraction of those units are active on any given cycle during a typical workload. Run a database query, compile a large C++ project, render a web page, decompress a video stream. The utilisation numbers are consistently underwhelming. On most workloads, the average instructions-per-cycle (IPC) is between 1.5 and 3.5 on a machine that can theoretically retire 6 instructions per cycle. Half the execution units sit idle on a typical cycle. Some of them are idle most of the time.

This looks like an extravagance. Why build twelve execution ports if the average program only uses four or five at once? Why spend transistor budget and power on silicon that spends most of its life doing nothing?

The extra units are there because real instruction streams are bursty, because dependencies create unpredictable gaps that can only be filled opportunistically, and because the cost of stalling when you could have executed is far higher than the cost of an idle unit. The width of a superscalar core is not sized for the average case. It is sized for the peaks, and the peaks determine whether the machine can keep its pipeline full through the dependency chains, cache misses, and branch mispredictions that define real program behaviour.

What Execution Units Actually Are

A CPU core is not a single thing that "executes instructions." It is a collection of specialised hardware blocks, each performing a specific class of operations, and a scheduling apparatus that routes instructions to the appropriate block. Each of these blocks is an execution unit, and a modern core has many of them because different instructions need different hardware.

The major categories:

Integer ALUs (Arithmetic Logic Units) handle addition, subtraction, bitwise operations, shifts, rotates, and comparisons on general-purpose registers. These are the workhorses. A modern core typically has three or four integer ALUs because integer arithmetic dominates most code. On Golden Cove, integer ADD, SUB, AND, OR, XOR, and shift operations can execute on ports 0, 1, 5, and 6, giving four-wide integer throughput. On Zen 4, the integer cluster has four ALU pipes.

Integer multiply and divide units are separate from the basic ALUs because multiplication requires a dedicated array multiplier circuit that is larger and slower. Most cores have one or two multiply units. Division is even more expensive: it is typically iterative (not pipelined) and takes 20 to 90 cycles depending on operand width and microarchitecture. Intel Golden Cove has one divider unit shared on port 0.

Address Generation Units (AGUs) compute memory addresses from base registers, index registers, scale factors, and displacements. Every load and store instruction needs an address before it can access the cache. Modern x86 cores have two or three AGUs, often paired with the load and store pipelines. On Zen 4, there are three AGUs: two serving loads and one serving stores (though one of the load AGUs can also handle stores).

Load units read data from the L1 data cache into registers. They take the address from the AGU, perform the cache lookup, handle alignment, and deliver the result. Golden Cove has two load ports (ports 2 and 3), each capable of delivering a 512-bit load per cycle from L1. Zen 4 has three load pipes.

Store units write data from registers to the store buffer and eventually to the L1 cache. Stores are split into two parts in most microarchitectures: the store-address operation (computed by the AGU) and the store-data operation (which moves the register value into the store buffer). Golden Cove has two store-data ports (ports 7 and 8 in Raptor Cove, or port 4 and port 7/9 depending on the specific stepping) and two store-address ports that share with the load AGUs.

Floating-point units (FPUs) handle IEEE 754 arithmetic: addition, subtraction, multiplication, division, square root, and fused multiply-add (FMA) on scalar and packed floating-point operands. FP multiply is typically pipelined at 3 to 5 cycles of latency. FP division and square root are multi-cycle and partially pipelined (one new operation can start every few cycles). On Golden Cove, FMA operations execute on ports 0 and 1, giving two FMAs per cycle on 256-bit (AVX2) or 512-bit (AVX-512) vectors.

SIMD/vector units are often the same physical hardware as the FPU, handling both floating-point and integer vector operations. A 256-bit SIMD unit can process eight 32-bit integers or four 64-bit doubles in a single operation. The distinction between "FPU" and "SIMD unit" is blurry on modern cores; what matters is the number of vector execution ports and their width. Golden Cove has two 256-bit FMA units (that can fuse into one 512-bit unit for AVX-512 on supported SKUs). Zen 4 has two 256-bit FMA pipes.

Branch units evaluate branch conditions and resolve branch predictions. When the branch predictor guessed correctly (which it does 95 to 99 percent of the time on well-behaved code), the branch unit confirms the prediction and the pipeline continues at full speed. When the predictor guessed wrong, the branch unit triggers a pipeline flush and redirect, which costs 15 to 20 cycles. Most cores have one or two branch ports. On Golden Cove, branches execute on port 0 and port 6.

Specialised units vary by microarchitecture: AES-NI for encryption, SHA for hashing, CRC, population count, bit manipulation (BMI1/BMI2). Small in area, critical for specific workloads.

The total on Golden Cove is roughly twelve execution ports, each handling some subset of the instruction set. Not every port can do everything. The scheduler's job is to match each decoded instruction to a port that can handle it and that is available on the target cycle.

How Superscalar Dispatch Works

A scalar processor executes one instruction per cycle. A superscalar processor executes multiple instructions per cycle by having multiple execution units and dispatching to them in parallel. The "width" of a superscalar core is the number of instructions it can dispatch (and ideally retire) per cycle.

The pipeline of a modern out-of-order superscalar core has several stages relevant to dispatch:

  1. Fetch: The front end reads instruction bytes from the instruction cache, guided by the branch predictor. Modern x86 cores fetch 16 or 32 bytes per cycle, which translates to roughly 4 to 8 instructions depending on encoding length.

  2. Decode: The fetched bytes are decoded into micro-operations (uops). On x86, a single instruction may decode into one, two, or more uops (a simple ADD decodes to one uop; a memory-destination ADD decodes to a load uop fused with an ALU uop; a REP MOVSB decodes to many uops via the microcode sequencer). Golden Cove can decode 6 instructions per cycle. Zen 4 can decode 4 instructions per cycle but also has a uop cache (the Op Cache) that can deliver 9 uops per cycle, bypassing the decoders. Apple's Firestorm decodes 8 instructions per cycle.

  3. Rename/Allocate: Each uop's architectural registers are mapped to physical registers via the register rename table. This step eliminates false dependencies (write-after-write and write-after-read hazards on the same architectural register) by giving each uop a fresh physical register for its destination. The renamed uops are allocated entries in the reorder buffer (ROB) and dispatched to reservation stations.

  4. Schedule/Dispatch: The scheduler (also called the reservation station or issue queue) holds uops that are waiting for their source operands to become available. When all of a uop's inputs are ready and a suitable execution port is free, the scheduler dispatches the uop to the port. Golden Cove can dispatch up to 12 uops per cycle across its 12 ports. Zen 4 can dispatch up to 10 uops per cycle.

  5. Execute: The execution unit performs the operation and writes the result to the physical register file. The result is also broadcast on a bypass network, so any uop waiting on this result in the scheduler can pick it up in the same cycle or the next cycle.

  6. Retire/Commit: Completed uops are retired from the reorder buffer in program order. Retirement makes the architectural state visible and frees physical registers and ROB entries. Golden Cove can retire 8 uops per cycle. Zen 4 retires up to 8 as well.

The dispatch width (uops sent to execution per cycle) is typically wider than the decode width, which is wider than the sustainable retirement rate. This is intentional. The front end delivers instructions in bursts (some cycles it decodes 6 uops, other cycles it stalls on an I-cache fill), and the scheduler accumulates a backlog that can be drained rapidly during a burst of parallelism.

Theoretical IPC vs Real-World IPC

Every microarchitecture has a maximum IPC set by its dispatch and retirement width. For Golden Cove, the theoretical peak is 6 uops retired per cycle (limited by the 6-wide decode). For Zen 4, the peak is effectively 6 uops per cycle when the Op Cache is active. For Firestorm, the peak is 8.

Real software never approaches these peaks on a sustained basis. SPEC CPU 2017, the standard benchmark for single-threaded compute performance, typically measures IPCs in the range of 1.5 to 4.0 depending on the benchmark and microarchitecture. Integer benchmarks tend to land at 1.5 to 2.5 IPC. Floating-point benchmarks can reach 3.0 to 4.5 IPC when they are vectorised and the working set fits in cache. But these are benchmarks selected to represent compute-intensive code; real application workloads are often worse.

Here is a concrete measurement you can reproduce. Take a moderately complex C++ program, say, compiling LLVM with clang, and measure its IPC:

perf stat -e instructions,cycles,uops_retired.slots \
  make -j1 -C llvm-project/build

On a Zen 4 system (Ryzen 7950X), a single-threaded clang compilation of a large translation unit typically shows an IPC of about 1.4 to 1.8. On Golden Cove (Core i9-13900K), the same workload lands around 1.6 to 2.0. These are 6-wide machines achieving less than 2 instructions per cycle on average.

Where does the other 4 slots worth of capacity go?

The answer is a combination of front-end stalls (instruction cache misses, branch mispredictions, decode bubbles), back-end stalls (execution unit contention, long-latency operations like cache misses or divides), and the most pervasive constraint of all: data dependencies between instructions that force serialisation regardless of how many execution units are available.

Why Real Code Is Full of Dependencies

Consider this tight loop that sums an array of integers:

int sum = 0;
for (int i = 0; i < n; i++) {
    sum += array[i];
}

The compiler turns this into something like:

.loop:
    add     eax, [rdi + rcx*4]   ; sum += array[i]
    inc     rcx                   ; i++
    cmp     rcx, rsi             ; i < n?
    jl      .loop

There are four instructions per iteration. On a 6-wide machine, you might hope to execute multiple iterations in a single cycle. But look at the dependencies:

  • The ADD writes to EAX, and the next iteration's ADD reads EAX. This is a true data dependency (read-after-write). The second ADD cannot execute until the first ADD's result is available, which takes 1 cycle for an integer add.
  • The INC writes to RCX, and the next iteration's INC reads RCX. Same dependency.
  • The CMP reads RCX (from the INC in the same iteration) and the JL reads the flags (from the CMP). Both are true dependencies within the same iteration.

The result is that this loop has a critical path of 1 cycle per iteration, bottlenecked by the ADD latency (assuming the load hits L1 cache). The machine can execute the four instructions in a pipelined fashion, but it cannot overlap iterations because each iteration depends on the previous one's result. IPC here might reach 3 or 4 (four instructions in roughly one cycle due to pipelining), but the throughput is limited to one iteration per cycle, regardless of how many ALUs are available.

A compiler can improve this by unrolling the loop and using multiple accumulators:

int sum0 = 0, sum1 = 0, sum2 = 0, sum3 = 0;
for (int i = 0; i < n; i += 4) {
    sum0 += array[i];
    sum1 += array[i+1];
    sum2 += array[i+2];
    sum3 += array[i+3];
}
int sum = sum0 + sum1 + sum2 + sum3;

Now the four ADD instructions in one "super-iteration" are independent of each other (they write to different accumulators), so the machine can dispatch all four to different ALUs in the same cycle. IPC jumps to nearly 4, and throughput roughly quadruples. This transformation is the textbook example of extracting instruction-level parallelism, and it illustrates how dependency chains, not execution unit count, determine real performance.

But most code is not a simple reduction loop. Consider a more realistic example: traversing a hash map to look up a key.

uint64_t hash = hash_function(key);
size_t bucket = hash & (table->mask);
Entry *entry = table->buckets[bucket];
while (entry != NULL) {
    if (entry->hash == hash && keys_equal(entry->key, key)) {
        return entry->value;
    }
    entry = entry->next;
}

This code has pointer-chasing dependencies at every step. The address of table->buckets[bucket] depends on the hash computation. The first cache line of entry depends on the bucket load. The comparison depends on loading entry->hash. The next iteration depends on loading entry->next. Each load is a potential L1 miss (especially if the hash table is large), which means 4 to 5 cycles per load if it hits L2, 12 to 15 cycles if it hits L3, and 60 or more nanoseconds if it misses to DRAM. During each of those stalls, the machine has nothing useful to do for this chain of computation. It does not matter if you have twelve execution ports when the only instruction that can make forward progress is waiting for a cache line to arrive.

This is the core problem. Real programs are full of serialising dependencies: pointer chains, hash lookups, tree traversals, conditional branches that depend on loaded values, function calls through virtual dispatch tables. The theoretical parallelism in the instruction stream (the ILP) is much lower than the machine width, and the gap is where the idle execution units live.

Instruction-Level Parallelism: What The CPU Can Extract

Instruction-Level Parallelism (ILP) is a property of the instruction stream, not of the machine. It describes how many instructions from the stream could, in principle, execute simultaneously if there were no machine-width limits and infinite scheduling resources. The machine's job is to extract as much of the available ILP as it can, given its finite resources.

The theoretical ILP of a program depends on its dependency graph. If you draw each instruction as a node and each true data dependency as an edge, the ILP is the average width of the graph when scheduled optimally (the ratio of total instructions to the length of the critical path).

Academic studies have measured the ILP of real programs extensively. The landmark paper is Wall's 1991 study ("Limits of Instruction-Level Parallelism"), which used trace-driven simulation to measure ILP under various idealised assumptions. With a perfect branch predictor, infinite scheduling window, and perfect memory disambiguation, Wall found ILP values ranging from 2 to 7 for integer programs and up to 60 or more for floating-point programs with heavily vectorisable loops. With realistic constraints (finite window, imperfect prediction), integer ILP dropped to 2 to 3 for most programs.

More recent studies, using modern traces and accounting for modern branch predictors and cache hierarchies, have found similar results. Integer workloads like compilers, databases, and web servers show ILP of 2 to 4. Scientific computing and media processing can show ILP of 4 to 8 or higher when the compiler has vectorised the hot loops. But the mix that runs on a typical server or desktop is dominated by the low-ILP integer code.

The CPU extracts ILP through several mechanisms:

Out-of-order execution is the most important. The scheduler looks ahead in the instruction stream (via the reorder buffer) for instructions whose operands are ready and dispatches them regardless of program order. If instruction 10 depends on instruction 9 but instruction 11 is independent, instruction 11 can execute before instruction 10 finishes. The reorder buffer on Golden Cove holds 512 uops, giving the scheduler a window of hundreds of instructions to search for parallelism. Zen 4's ROB holds 320 entries. Firestorm's holds 630. The larger the ROB, the further ahead the machine can look, and the more ILP it can extract from code with long dependency chains separated by independent work.

Register renaming eliminates false dependencies. Consider:

add rax, rbx       ; writes rax
mov rcx, [rdx]     ; independent, reads rdx, writes rcx
add rax, rcx       ; reads rax (from first add) and rcx (from mov)

Without renaming, the second instruction writes to RCX. If a later instruction also writes to RCX, it would create a write-after-write (WAW) dependency that would serialise them. Renaming maps each write to a unique physical register, so the two writes to RCX go to different physical registers, and the dependency is dissolved. Golden Cove has 280 physical integer registers and 332 physical vector registers, far more than the 16 architectural registers that x86-64 exposes. Zen 4 has 224 integer and 192 vector physical registers.

Memory disambiguation allows loads to execute before earlier stores whose addresses are not yet known. The load-store unit speculatively assumes that the load does not alias with any pending store and executes it immediately. If the store was to the same address (detected later when the store address is computed), the load is replayed. This speculation is correct most of the time and allows loads to bypass the store queue, which is critical because loads are on the critical path of most code.

Branch prediction allows the machine to fetch, decode, and speculatively execute instructions past a branch before the branch condition is resolved. A modern predictor (like Golden Cove's or Zen 4's TAGE-based predictor) achieves 97 to 99 percent accuracy on typical code, which means speculative execution beyond branches is almost always productive.

Together, these mechanisms transform a sequential instruction stream into a partially parallel one. But they cannot create parallelism that does not exist in the program. If every instruction depends on the previous one, no amount of out-of-order machinery will help.

The Reorder Buffer, Reservation Stations, And The Scheduling Dance

Understanding why the ROB size matters requires understanding the scheduling loop.

When a uop is decoded and renamed, it enters the ROB (which tracks it for in-order retirement) and is dispatched to a reservation station (also called a scheduler queue) associated with its execution port or cluster. The reservation station holds the uop along with the identifiers of its source operands. When a source operand is produced by an earlier uop, the producing uop's physical register number is recorded, and the reservation station monitors the bypass network for that result. The moment all sources are available, the uop is eligible for dispatch.

The key constraint is the ROB size, because it limits how far ahead the machine can look. Suppose you have a cache miss that takes 200 cycles to resolve (an L3 miss to DRAM on a 4 GHz core is about 250 to 300 cycles). During those 200 cycles, the machine will continue to fetch, decode, and rename new instructions. Each one takes a ROB entry. With a 320-entry ROB (Zen 4), the machine can look about 320 instructions ahead. If there are independent instructions within that window that do not depend on the stalled load, the machine can execute them, keeping some execution units busy during the stall. If the independent instructions run out before the 320-entry window is exhausted, or if the ROB fills up and no more instructions can enter, the machine stalls completely.

This is called the "window of extraction." The ROB size sets the maximum window, and the actual useful window depends on how much independent work exists in the instruction stream. For a tight pointer-chasing loop, there may be no independent work at all, and a 320-entry ROB is no better than a 32-entry one. For a loop that interleaves several independent computations (say, processing four separate input streams), a larger ROB captures more of the available parallelism.

Reservation station sizing matters similarly. Golden Cove has a unified scheduler with 97 entries for integer operations and separate schedulers for FP/vector and memory operations. If the scheduler fills up because many uops are waiting on a long-latency result, new uops cannot enter even if ROB space is available. In practice, the scheduler is rarely the bottleneck on modern cores (it is sized to be large enough for typical stall durations), but it can become one under extreme memory latency or when many instructions share a dependency on a single long-latency result.

Apple's Firestorm core is instructive. It has a 630-entry ROB and extremely large scheduler queues, letting it look much further ahead than Intel or AMD cores. This is one reason Apple achieves high IPC despite lower clocks (3.2 to 3.5 GHz versus 5.0 to 5.8 GHz for Intel/AMD). The wider window extracts more ILP, and the lower clock cuts power consumption.

Here is how you can observe the scheduling window's effect in practice. Consider two versions of a memory-bound loop:

// Version A: One dependent load chain
for (int i = 0; i < n; i++) {
    value = data[indices[i]];  // random access, likely cache miss
    result += process(value);
}
 
// Version B: Four interleaved independent chains
for (int i = 0; i < n; i += 4) {
    v0 = data[indices[i]];
    v1 = data[indices[i+1]];
    v2 = data[indices[i+2]];
    v3 = data[indices[i+3]];
    result += process(v0) + process(v1) + process(v2) + process(v3);
}

Version B runs significantly faster because the four independent load chains allow the memory subsystem to have multiple cache misses in flight simultaneously (memory-level parallelism, or MLP). The ROB holds all four chains, and the scheduler dispatches loads as fast as the memory system can accept them rather than waiting for each miss to resolve sequentially. On a machine with two load ports and a deep enough ROB, version B can achieve close to 2x the memory bandwidth of version A.

Diminishing Returns: Amdahl's Law Applied To ILP

If a 6-wide machine achieves IPC of 2 on typical code, why not build a 12-wide machine and achieve IPC of 4? Because the relationship between machine width and achieved IPC is not linear. It follows a curve of diminishing returns that is structurally similar to Amdahl's law.

Consider the instruction stream as a mix of parallelisable segments (where multiple instructions can execute simultaneously) and serial segments (where instructions are chained by dependencies and must execute one at a time). Doubling the machine width can, at most, double the throughput of the parallel segments but does nothing for the serial segments. If 50 percent of execution time is in serial segments (dependency chains), then doubling width improves overall throughput by at most 33 percent (the parallel half takes half the time, but the serial half is unchanged).

The empirical data bears this out. Researchers have modelled IPC scaling as a function of dispatch width and found diminishing returns setting in sharply beyond 4 to 6 wide dispatch. A 2-wide machine captures most of the ILP in integer code (IPC around 1.5). A 4-wide machine adds another 0.5 to 1.0 IPC. Going from 4-wide to 6-wide adds perhaps 0.3 to 0.5 more. Going from 6-wide to 8-wide adds less still. The curve flattens because the remaining unexploited ILP is harder to extract: it requires looking further ahead (larger ROB), predicting branches more accurately, resolving memory ambiguity faster, and all of these have their own diminishing returns.

This is why no major x86 vendor has gone beyond 6-wide decode. The cost of widening the decode, rename, and retirement logic scales roughly quadratically (because the rename stage must check every new uop against every other new uop in the same cycle for dependencies), while the IPC benefit scales sub-linearly. At some point the transistor budget is better spent on other things: larger caches, better branch predictors, wider SIMD units, or more cores.

Apple is the notable exception. Firestorm decodes 8 instructions per cycle, and the M-series cores consistently achieve higher IPC than their x86 competitors. But Apple has advantages that Intel and AMD do not. ARM instructions are fixed-length (4 bytes), which makes decode dramatically simpler than x86 (where instructions vary from 1 to 15 bytes and require complex length-decoding logic). Apple also controls the entire software stack, so they can optimise the compiler and OS to expose more ILP. And Apple targets a power-efficient design point where frequency is lower, which means widening the core is a better trade-off than pushing clock speed. We will return to Apple later.

Measuring Execution Unit Utilisation With Performance Counters

Modern CPUs expose detailed performance monitoring counters (PMCs) that let you observe exactly how the execution back end is being utilised. On Intel, the key counters for understanding port utilisation are the UOPS_DISPATCHED_PORT family:

perf stat -e \
  uops_dispatched_port.port_0,\
  uops_dispatched_port.port_1,\
  uops_dispatched_port.port_2_3,\
  uops_dispatched_port.port_4_9,\
  uops_dispatched_port.port_5,\
  uops_dispatched_port.port_6,\
  uops_dispatched_port.port_7_8 \
  ./my_program

This shows how many uops were dispatched to each port. On AMD, the equivalent counters are less granular but still informative. The de_dis_uops_from_decoder family shows decode throughput, and ex_ret_uops shows retired micro-ops.

A more useful top-level analysis uses Intel's Top-Down Microarchitecture Analysis (TMA) framework, which classifies every pipeline slot into one of four categories:

  • Retiring: The slot was used to retire a useful uop. This is the productive fraction.
  • Bad Speculation: The slot was used for a uop that was later squashed due to a branch misprediction or other misspeculation.
  • Front-End Bound: The slot was empty because the front end could not deliver a uop (I-cache miss, decode stall, ITLB miss).
  • Back-End Bound: The slot was empty because the back end could not accept or execute a uop (execution port busy, cache miss stall, long-latency operation).

On recent Intel hardware, you can get TMA level 1 directly from perf stat:

perf stat --topdown -a -- ./my_program

A typical result for a compiler workload might look like:

retiring         25.3%
bad speculation   8.1%
frontend bound   22.4%
backend bound    44.2%

This tells you that out of all potential pipeline slots, only 25 percent were productively used, 8 percent were wasted on mispredicted paths, 22 percent were lost to front-end stalls, and 44 percent were lost to back-end stalls. The back-end bound fraction can be further decomposed into memory bound (waiting for cache/memory) and core bound (waiting for execution ports), and on most non-numerical workloads, memory bound dominates.

On AMD Zen 4, you can use the amd_core PMU with perf:

perf stat -e ex_ret_uops,ex_ret_brn_misp,ls_dispatch.ld_dispatch,\
ls_dispatch.store_dispatch,cycles \
  ./my_program

This gives you retired uops, mispredicted branches, and load/store dispatch counts, from which you can derive rough utilisation numbers.

The consistent finding across workloads and microarchitectures is that execution port saturation is rare. The bottleneck is almost always either the front end (branch misprediction, I-cache miss) or the memory subsystem (L2/L3 miss, TLB miss), not a shortage of ALUs. Execution units are idle not because there are too many of them, but because the machine cannot feed them fast enough.

Port Pressure: When Execution Units Do Become The Bottleneck

There are workloads where execution port saturation is the primary limiter, and they illustrate why having many ports matters.

Consider a tight SIMD loop that performs a fused multiply-add on dense arrays:

// AVX2 FMA on single-precision floats
for (int i = 0; i < n; i += 8) {
    __m256 a = _mm256_load_ps(&A[i]);
    __m256 b = _mm256_load_ps(&B[i]);
    __m256 c = _mm256_load_ps(&C[i]);
    __m256 result = _mm256_fmadd_ps(a, b, c);
    _mm256_store_ps(&D[i], result);
}

Each iteration produces roughly: 3 load uops, 1 FMA uop, 1 store-address uop, and 1 store-data uop, plus loop control. On Golden Cove, the FMA can only execute on port 0 or port 1. If the FMA is the bottleneck (because FMA throughput is 2 per cycle on these ports), and the loads are on ports 2 and 3 (2 per cycle), the loop can sustain roughly 2 FMA operations per cycle, processing 16 floats per cycle. The execution ports are now the limiter, not memory (assuming the data is in L1 or L2).

You can verify this with Agner Fog's instruction tables (available at agner.org/optimize). Fog meticulously catalogues the latency, throughput, and port assignment of every instruction on every recent microarchitecture. For VFMADD231PS (the AVX2 FMA instruction) on Golden Cove:

  • Latency: 4 cycles
  • Throughput: 0.5 cycles (i.e., 2 per cycle)
  • Ports: 0, 1

For VMOVAPS (256-bit aligned load):

  • Latency: 5 cycles (from L1)
  • Throughput: 0.5 cycles (2 per cycle)
  • Ports: 2, 3

With two loads per cycle and two FMAs per cycle, the steady-state throughput of the loop is limited by whichever resource is exhausted first. If the loop body needs three loads per FMA (as in the A*B+C example above), loads become the bottleneck and one of the two FMA ports sits partially idle. If the loop can be restructured to need fewer loads (for example by reusing a value across multiple FMAs, as in matrix multiplication), the FMA ports become the bottleneck and the loads are under-utilised.

This is exactly the kind of analysis that performance engineers do when optimising numerical kernels. Tools like llvm-mca (the LLVM Machine Code Analyzer) and Intel's IACA (Intel Architecture Code Analyzer) can simulate port pressure for a given assembly snippet:

echo ".loop:" > kernel.s
echo "vfmadd231ps ymm0, ymm1, [rdi]" >> kernel.s
echo "vfmadd231ps ymm2, ymm3, [rdi+32]" >> kernel.s
echo "add rdi, 64" >> kernel.s
echo "dec ecx" >> kernel.s
echo "jnz .loop" >> kernel.s
 
llvm-mca -mcpu=alderlake -iterations=100 kernel.s

The output shows per-port pressure and the estimated throughput of the loop body, letting you see exactly which port is the bottleneck and how much headroom exists on the others.

The key observation is that port pressure analysis matters for tight numerical loops, but most code never hits port pressure limits. The surplus ports exist for the rare moments when the instruction mix demands them, and for the common case where multiple ports for the same operation type (like four integer ALUs) let the scheduler avoid contention when several independent integer operations are ready simultaneously.

Bursty Parallelism: Why Average Utilisation Misses The Point

The average utilisation numbers are misleading because they hide the temporal distribution of parallelism. Real instruction streams do not have a steady, uniform level of ILP. They alternate between phases of high parallelism (inside unrolled loops, across independent function calls, during the independent parts of a data structure traversal) and phases of almost zero parallelism (waiting for a cache miss, resolving a long dependency chain, recovering from a branch misprediction).

Think of it like a motorway. The average traffic flow might be 40 percent of capacity, but that does not mean you can narrow it to two lanes. During rush hour, it needs all six lanes. During the night, two lanes would suffice. The extra lanes exist for the peaks, and the peaks are what determine whether traffic jams form.

The same logic applies to execution units. If a burst of 10 independent instructions arrives at the scheduler in a single cycle (because several dependency chains resolved simultaneously, or because the front end caught up after a stall), the machine needs enough ports to dispatch them quickly. If it can only dispatch 4 per cycle, the burst takes 3 cycles to drain, and the instructions at the tail of the burst experience unnecessary latency. That latency propagates: any instruction waiting on the result of a burst-tail instruction is delayed, and the delay cascades.

Wider dispatch acts as a shock absorber. It smooths out the bursts so that when parallelism spikes, the machine can capitalise on it immediately rather than queuing instructions in the scheduler. The benefit does not show up in average IPC (because the average is dominated by the low-parallelism phases), but it shows up in tail latency and in the total execution time.

You can observe this effect empirically. Take a workload with mixed parallelism (a web server processing requests, for example) and compare its execution time on a machine with a wider back end versus a narrower one with the same clock speed. The wider machine will be faster, even though both machines show similar average IPC, because the wider machine handles the bursty phases better. This is exactly what you see when comparing Apple's M-series cores (8-wide decode, massive scheduler) to Intel's Golden Cove (6-wide decode) at matched clock speeds: Apple's IPC advantage is largest on workloads with high but bursty parallelism.

A database query executor illustrates this well. During a hash join, the processor alternates between phases of hash computation (highly parallel, multiple independent hash lanes), probe into the hash table (pointer chasing, serialised), and result materialisation (independent writes to output buffers, parallel again). The hash computation phase might briefly saturate 6 of 12 execution ports. The probe phase might use 1 or 2. The materialisation phase might use 3 or 4. The average over the entire query might be 2.5, but the peak demands 6, and having only 4 would create visible stalls during the hash phase.

From Scalar MIPS To 6-Wide Decode: A Brief History

The progression from simple to wide cores mirrors the industry's evolving understanding of ILP and the economics of transistor budgets.

Early 1980s: Scalar, in-order pipelines. The MIPS R2000 (1985) and its contemporaries executed one instruction per cycle through a simple 5-stage pipeline (fetch, decode, execute, memory, writeback). IPC was at most 1.0, and in practice around 0.5 to 0.8 due to pipeline stalls from branches and load-use hazards. These machines had one of each execution unit: one ALU, one multiplier, one load/store unit.

Late 1980s to early 1990s: Superscalar, in-order. The first superscalar designs (Intel i960, IBM POWER1) could issue 2 instructions per cycle by adding a second execution unit and dispatch logic. The key challenge was detecting which pairs of instructions were independent enough to execute simultaneously, which required dependency checking hardware that grew with the square of the issue width.

Mid 1990s: Out-of-order execution. The Intel Pentium Pro (1995) introduced out-of-order execution to the x86 world, with a 40-entry ROB and 3-wide dispatch. The MIPS R10000 (1996) did the same for MIPS. Out-of-order execution was a game-changer: it decoupled the program order from the execution order, allowing the hardware to find and exploit ILP that in-order machines left on the table. IPC jumped from 0.8 to 1.5 or so on typical code.

Late 1990s to early 2000s: The ILP wall. Intel's Pentium 4 (2000) bet on deep pipelines (20 to 31 stages) for clock speed. AMD's K7/K8 (1999 to 2003) bet on IPC with 3-wide decode. The academic consensus formed that integer ILP was limited to 2 to 4, and going wider than 4-wide had diminishing returns. This led to the multi-core era starting around 2005, where transistor budgets shifted from wider cores to more cores.

2010s: Incremental widening. Sandy Bridge (2011) added a uop cache to bypass the 4-wide decoder. Haswell (2013) added a fourth ALU port. Skylake (2015) delivered up to 6 uops per cycle via the uop cache. Zen (2017) matched with 4-wide decode plus its own Op Cache. Each generation extracted another fraction of IPC.

2020s: The modern wide cores. Golden Cove (2021) went to 6-wide decode. Apple's Firestorm (2020) went to 8-wide decode. AMD's Zen 4 (2022) kept 4-wide decode but expanded the Op Cache to deliver 9 uops per cycle. The execution back ends grew to 10 to 12 ports. ROBs expanded to 320 to 630 entries. These are the widest general-purpose cores ever built, and they represent the current frontier of what is practical.

The pattern is clear: each doubling of width costs significantly more transistors and power for a smaller IPC increment. The industry has not stopped going wider, but the pace has slowed, and the gains are measured in single-digit percentages per generation rather than the 50 percent leaps of the 1990s.

Apple M-Series: A Case Study In Going Extra Wide

Apple's CPU cores deserve separate discussion because they represent the most aggressive width scaling in production. The Firestorm performance core (M1, 2020) and its successors (Avalanche in M2, unnamed in M3 and M4) decode 8 ARM instructions per cycle, dispatch to a back end with 13 or more execution ports, and retire up to 8 uops per cycle, backed by a 630-entry ROB (on Firestorm; likely larger on later generations).

Why can Apple go wider than Intel and AMD?

Fixed-length instructions. ARM's A64 instruction set uses fixed 4-byte instructions. Decoding 8 instructions per cycle means reading a 32-byte block and splitting it into 8 equal chunks. The decode logic is simple combinational circuits. x86 instructions vary from 1 to 15 bytes, which means the decoder must first determine where each instruction starts (length decoding), which is an inherently serial process. Intel and AMD work around this with a uop cache that stores pre-decoded uops, bypassing the decoders for hot code. But the decoders still limit cold-code throughput, and the uop cache has finite capacity.

Unified memory architecture. Apple's M-series SoCs use LPDDR that is soldered to the package with very wide buses (up to 512 bits on M3 Max). The CPU, GPU, and NPU share this memory pool, and the CPU's memory subsystem is tuned for the specific latency characteristics of the on-package memory. This gives Apple's cores more predictable memory latency than a PC with swappable DIMMs on a socket, which helps the wide back end stay fed.

Compiler co-design. Apple controls clang/LLVM on macOS and iOS, and they can (and do) optimise the compiler's instruction scheduling and register allocation specifically for their wide cores. A compiler that knows it has 8-wide dispatch and a 630-entry ROB can schedule instructions more aggressively, reorder loads to improve MLP, and avoid patterns that create unnecessary port conflicts.

Power budget trade-off. Apple's cores run at 3.2 to 3.5 GHz, compared to 5.0 to 5.8 GHz for Intel and AMD's performance cores. Frequency and voltage scale together: running at 60 percent of the frequency allows running at a significantly lower voltage, which cuts dynamic power by roughly the square of the voltage ratio. The saved power budget is reinvested in width: more execution units, a larger ROB, wider dispatch. The net result is higher IPC at lower frequency, which achieves similar or better single-threaded performance at much lower power consumption. On SPEC CPU 2017, Firestorm achieves roughly the same score per GHz as Golden Cove but at half the wattage.

The Firestorm core's measured IPC on SPEC CPU 2017 integer workloads is typically 3.0 to 4.0, versus 2.0 to 3.0 for Golden Cove. The extra width is not wasted; it captures ILP that the 6-wide x86 cores leave on the table. But the gains are not proportional to the width increase (8/6 = 1.33x width, but IPC improvement is 1.2 to 1.4x), confirming that diminishing returns still apply.

What Limits Going Wider

If Apple can go 8-wide, why not 16-wide? Why not 32-wide? Several constraints create a practical ceiling.

Dependency checking scales quadratically. The rename stage must check every new instruction against every other instruction in the same rename group. For an N-wide rename, this requires N*(N-1)/2 comparisons per cycle. Going from 6-wide to 8-wide increases comparisons from 15 to 28. Going to 16-wide increases them to 120. Each comparison is a circuit on the critical path of the rename stage, which must complete in a single clock cycle.

Register file ports scale painfully. Each dispatched uop reads source operands and writes a result. A register file with R entries, P read ports, and Q write ports has area scaling as R * (P + Q). Doubling dispatch width roughly doubles the ports needed, doubling area and increasing access latency until the register file limits clock frequency.

Bypass network complexity. Each execution unit must broadcast results to all reservation stations via a bypass network with O(N^2) wires. At 12 ports this is already a significant fraction of wire routing. At 24 it would dominate the physical layout.

Branch prediction bandwidth. A wider fetch block contains more branches, requiring multiple predictions per cycle. The predictor tables are already 64 KiB or more, and doubling lookup bandwidth is expensive.

Diminishing ILP. Even with infinite width, the ILP of typical integer code tops out at 4 to 8. Going wider provides no benefit during the serial phases.

Power. A core that is 50 percent wider consumes perhaps 20 to 30 percent more power (clock gating helps, but leakage from idle transistors is real at 4 nm), and that power comes from somewhere in the thermal budget.

These constraints explain why the industry has converged on 6 to 8-wide for performance cores and 3 to 4-wide for efficiency cores.

The GPU Approach: Thousands Of Simple Units

GPUs take the opposite approach to the same fundamental problem. Instead of building a few wide, complex cores with deep out-of-order machinery, a GPU builds thousands of simple, narrow, in-order execution units and relies on massive thread-level parallelism (TLP) to keep them busy.

An NVIDIA RTX 4090 (Ada Lovelace architecture) has 128 streaming multiprocessors (SMs), each containing 128 CUDA cores, totalling 16,384 simple ALUs. There is no out-of-order execution, no register renaming, and no reorder buffer. Instead, each SM holds thousands of threads and switches between groups of 32 (a "warp") every cycle to hide latency. When one warp stalls on a memory access, the scheduler switches to another warp that has data ready. With enough warps in flight, memory latency is completely hidden.

This is a different bet about where parallelism comes from:

  • CPU bet: Parallelism is scarce and must be extracted from a single instruction stream using expensive hardware (OoO, rename, large ROB). The execution units are wide and complex to handle the diverse instruction mix of general-purpose code.

  • GPU bet: Parallelism is abundant in the target workload (graphics, matrix multiplication, physics simulation) and can be expressed as thousands of independent threads. The execution units can be simple and narrow because the thread scheduler handles latency hiding, not the execution pipeline.

The GPU approach achieves much higher throughput per watt for data-parallel workloads, but per-thread performance is far worse. A single CPU core at 5 GHz with IPC of 2 retires 10 billion operations per second. A single GPU thread on a CUDA core at 2.5 GHz retires 2.5 billion when running, but shares its core with dozens of other threads and may stall for hundreds of cycles on memory.

This is why CPUs and GPUs coexist. The CPU handles serial, branchy, pointer-heavy code (operating systems, query planners, parsers, garbage collectors) where single-thread latency matters. The GPU handles parallel, regular, throughput-oriented code (neural network inference, image processing, physics) where aggregate throughput matters. The CPU's wide superscalar core with its many execution units is the right design for the first kind of work, even if most of those units are idle most of the time.

Concrete Example: Anatomy Of A Database Query

To make the utilisation story concrete, let us trace through what happens on the CPU during a simple analytical query: scanning a columnar table of 100 million rows, filtering on one column, and summing another.

In pseudo-SQL:

SELECT SUM(revenue) FROM orders WHERE region = 'EU';

A columnar database engine (like ClickHouse or DuckDB) processes this by iterating over the region column in blocks of 1,024 or so values, producing a selection bitmask, then iterating over the revenue column using the bitmask to select values for summation.

The filter phase (comparing each region value to 'EU') is highly parallel. The comparison of each value is independent, and a SIMD-vectorised implementation can compare 32 values per cycle using AVX2 VPCMPEQB on packed byte strings. During this phase, the vector execution ports (ports 0 and 1 on Golden Cove) are busy, the load ports are busy streaming data from L2 or L3, and the integer ALUs are mostly idle (just loop control). Execution port utilisation might reach 60 to 70 percent of the theoretical maximum.

The aggregation phase (summing revenue values selected by the bitmask) has more dependencies. Each selected value must be loaded (which requires computing the offset from the bitmask using a population count and a lookup), added to the accumulator, and the accumulator carries a dependency chain. With careful implementation using multiple accumulators and SIMD horizontal adds, the parallelism can be kept reasonable, but it is lower than the filter phase. Port utilisation might drop to 40 to 50 percent.

Between blocks, there is bookkeeping: advancing the block pointer, checking for end-of-column, potentially decompressing the next block. This code is scalar, branchy, and has high latency due to metadata cache misses. Port utilisation might drop to 20 percent.

The average across the entire query might be 35 to 45 percent port utilisation. But the peak (during the SIMD filter inner loop) demands 6 or more ports simultaneously, and having fewer than that would directly reduce the scan throughput. The "extra" ports that sit idle during the bookkeeping phase are what allow the scan phase to run at full speed.

You can measure this yourself on a Zen 4 or Golden Cove machine using perf with the TMA methodology:

perf stat --topdown -- clickhouse-local \
  --query "SELECT sum(revenue) FROM file('orders.parquet') WHERE region = 'EU'"

The TMA breakdown will show the Retiring fraction peaking during the scan/filter phases and dropping during metadata handling, with the Backend Bound fraction rising during cache-miss phases and the Frontend Bound fraction showing occasional spikes from branch mispredictions at block boundaries.

What Real Workloads Look Like In The Pipeline

Different workloads stress different parts of the execution back end. Based on published performance counter studies and analysis from Agner Fog's optimisation guides, the characteristic patterns are:

Web servers (nginx, Node.js): IPC of 1.0 to 1.5. Front-end bound from I-cache misses and branch mispredictions due to large code footprints and dynamic dispatch. Execution units are mostly starved for uops rather than overloaded.

OLTP databases (PostgreSQL, MySQL): IPC of 1.2 to 2.0. Back-end memory bound from B-tree traversal and hash table lookups. Integer ALUs sit idle while the core waits for cache lines.

Video encoding (x264, x265): IPC of 2.5 to 4.0 during hot vectorised kernels, dropping to 1.5 during scalar rate-control decisions. One of the few workloads that regularly saturates FMA ports.

Compilers (GCC, clang): IPC of 1.4 to 2.0. Back-end memory bound from large AST and IR data structures that overflow L3.

Dense linear algebra (BLAS dgemm): IPC of 3.0 to 5.0, limited by FMA throughput. The poster child for port pressure analysis, where kernels are hand-tuned to hit the theoretical ceiling of the FMA ports.

Game engine main threads: IPC swings from 1.0 during scene graph traversal to 3.5 during physics updates. The wide back end matters because stalling during the physics burst directly increases frame time.

Why The Surplus Is Structural, Not Incidental

The recurring theme is that execution unit surplus is a design choice, not a design flaw. The extra units serve several purposes that average utilisation metrics do not capture:

Latency reduction through opportunistic execution. When a burst of independent instructions becomes ready, the machine can execute them all in one or two cycles instead of queuing them. This reduces the latency of downstream dependent instructions, which reduces the total execution time. The benefit is invisible in utilisation metrics but visible in wall-clock time.

Covering for variable instruction mixes. The instruction mix changes from cycle to cycle. In one cycle, the machine might need 3 integer ALU operations, 1 load, and 1 branch. In the next cycle, it might need 0 integer ALU, 2 loads, 1 store, and 1 FMA. Having only 2 integer ALUs would create a stall in the first cycle; having only 1 load port would create a stall in the second. Wide per-category availability (4 ALUs, 2 or 3 load ports, 2 FMA units) ensures that no single category becomes a persistent bottleneck.

Enabling memory-level parallelism. Multiple load ports allow the machine to have multiple cache miss requests in flight simultaneously. This is memory-level parallelism (MLP), and it is one of the most important performance mechanisms on modern hardware. A single load port can only issue one cache miss at a time and must wait for it to resolve before issuing the next (in the worst case). Two or three load ports can overlap misses, cutting the effective memory stall time by 2x to 3x on workloads with independent load chains.

Reducing scheduler pressure. A wider back end drains the scheduler queues faster. If the scheduler fills up (because the back end cannot dispatch fast enough), the front end stalls, and the pipeline drains. This creates a cascade of lost cycles that is much more expensive than the idle power of a few extra execution units.

Supporting diverse instruction set extensions. AVX-512, AES-NI, SHA, CRC32, and other extensions each need dedicated hardware. These units are idle when the workload does not use the extension, but they are critical when it does. An AES-NI unit is idle 99 percent of the time on a web server, but during TLS handshakes it processes 128-bit cipher blocks at a rate that would take dozens of cycles with generic integer instructions. The alternative (not having the dedicated unit and doing AES in software) would increase TLS latency by an order of magnitude.

The Power Question

If the extra execution units are mostly idle, do they waste power? Less than you might expect, because of clock gating. When an execution unit has no uop dispatched, its clock signal is suppressed, eliminating dynamic power. The unit still draws static power (leakage), but a single integer ALU is perhaps 2 to 3 percent of core area, so the leakage cost of an idle ALU is on the order of 1 percent of core power. Two extra ALUs that are idle 50 percent of the time cost 1 to 2 percent of core power in exchange for meaningful IPC gains.

AVX-512 is the notable exception. The 512-bit FMA units are large and power-hungry. Intel throttles the core clock by 100 to 300 MHz when AVX-512 is active, and the peak non-AVX-512 frequency is constrained by the thermal budget that must accommodate those units' worst-case power draw. AMD sidesteps this with Zen 4 by implementing AVX-512 as double-pumped 256-bit operations, trading half the AVX-512 throughput for higher sustained clock speed on everything else.

Where The Industry Goes From Here

The trajectory is visible in the recent generational changes:

Intel's upcoming cores (Lion Cove, Panther Cove) are expected to continue at 6-wide or possibly 8-wide decode, with incremental improvements to the ROB size, scheduler depth, and branch predictor. The emphasis is shifting from wider dispatch to smarter dispatch: better prefetching, more accurate memory disambiguation, and reduced penalty for cache misses.

AMD's Zen 5 maintains the 4-wide decode but expands the back end, with wider SIMD units (full 512-bit native execution) and a larger ROB. AMD's bet is that the Op Cache bypass makes decode width less important than back-end execution resources.

Apple continues to push the width frontier, with rumours of 10-wide or wider decode in future generations. Apple's control over the ISA (they can add custom instructions that improve code density and ILP) and the software stack gives them headroom that x86 vendors do not have.

The direction is consistent: smarter, not just wider. The low-hanging width gains have been captured, and future IPC improvements will come from better prediction, better prefetching, and better latency tolerance rather than from adding a thirteenth execution port.

Conclusion

The surplus execution units in a modern CPU core are not over-engineering. They are a deliberate response to the statistical properties of real instruction streams: bursty parallelism, variable instruction mixes, long-latency stalls that can only be tolerated by executing independent work, and the fundamental limit that most programs have an ILP of 2 to 4, which means any given cycle only uses a fraction of the available resources.

Building a core that matches the average parallelism would save silicon and power but would stall on every burst, losing far more performance than the saved resources are worth. Building a core that matches the peak would be wasteful in a different way, hitting diminishing returns as width increases. The sweet spot, 6 to 8-wide dispatch with 10 to 12 execution ports, is where the marginal IPC gain from one more port roughly equals the marginal cost in area, power, and complexity.

The next time you look at a perf stat output showing IPC of 1.8 on your 6-wide machine and wonder why 70 percent of the pipeline is idle, remember that the idle slots are not waste. They are capacity held in reserve for the moments when your code has enough parallelism to use them, and those moments, brief as they are, determine whether your program runs in 4 seconds or in 6.