11-04-2026

Why x86 Processors Break Instructions Apart And Fuse Them Back Together

Try the interactive lab for this article Take the quiz (6 questions · ~5 min)

Open any x86 instruction reference and the first thing you notice is how irregular the encoding is. There are well over 1,500 mnemonics in Intel 64, many with dozens of operand forms, and each instruction can be anywhere from one to fifteen bytes long. Even a plain ADD can be tiny and simple or turn into a long, prefix-heavy read-modify-write instruction that the decoder has to parse byte by byte.

That is a terrible shape for an out-of-order execution engine. Modern x86 cores solve it by translating instructions into smaller, fixed-format internal operations called micro-ops, then doing most of the real scheduling work on those. Some of the split work is later fused back together to save front-end bandwidth.

This article follows that crack-and-fuse pipeline in detail: why x86 needs it, how the decode path produces micro-ops, where fusion helps, and which costs never quite go away.

The x86 Encoding Problem

The x86 instruction set was designed in 1978 for the 8086, a 16-bit microprocessor that ran at 5 MHz and had 29,000 transistors. Intel's engineers optimised for code density, because memory was expensive and bus bandwidth was scarce. The encoding scheme they chose was variable-length, prefix-based, and deeply irregular. A modern x86-64 instruction can consist of up to five distinct fields, each of which can be present or absent depending on the opcode and the operand form:

[Legacy prefixes] [REX/VEX/EVEX prefix] [Opcode] [ModR/M + SIB] [Displacement] [Immediate]

Legacy prefixes include operand-size overrides (0x66), address-size overrides (0x67), segment overrides (0x2E, 0x36, etc.), the LOCK prefix (0xF0), and the REP/REPNE prefixes (0xF2, 0xF3). Any combination of these can appear, in any order, before the opcode. The REX prefix (0x40 through 0x4F) extends register addressing to 16 registers in 64-bit mode. VEX and EVEX prefixes encode AVX and AVX-512 instructions in two or four bytes. The opcode itself can be one, two, or three bytes (using the 0x0F escape byte and secondary escape bytes 0x38 and 0x3A). The ModR/M byte encodes the addressing mode and register operands, and if the addressing mode uses a scaled index, a SIB (Scale-Index-Base) byte follows. After all of that, a displacement (1, 2, or 4 bytes) and an immediate (1, 2, or 4 bytes) can appear.

The total length of an instruction is unknown until the decoder has inspected several bytes. Consider these three instructions:

; 1 byte: NOP (0x90)
nop
 
; 3 bytes: ADD EAX, 1 (0x83 0xC0 0x01)
add eax, 1
 
; 12 bytes: LOCK ADD QWORD [RBX + RCX*8 + 0x12345678], 0x11223344
; F0 48 81 84 CB 78 56 34 12 44 33 22 11
lock add qword [rbx + rcx*8 + 0x12345678], 0x11223344

The decoder cannot determine where instruction N+1 starts until it has finished parsing instruction N. This is a serial dependency chain in the encoding format. On a 4-wide decode front end that wants to decode four instructions per cycle, the decoder must determine four instruction boundaries in one cycle, which requires speculative length pre-decoding and careful alignment logic. Every RISC architecture avoids this problem by fixing instructions at 4 bytes (ARM AArch64, RISC-V RV32/64, MIPS, SPARC, Power).

The irregularity goes deeper. An x86 ADD instruction has over a dozen operand forms: register to register, immediate to register, immediate to memory, register to memory, memory to register, each in 8-bit, 16-bit, 32-bit, and 64-bit widths. Each form has a different opcode byte. Some opcodes reuse the ModR/M reg field as an opcode extension, so the decoder cannot determine which instruction it is looking at until it has parsed the ModR/M byte. This applies to about a third of the instruction set. All of this makes x86 decode logic large, power-hungry, and slow compared to a RISC decoder.

Why RISC Won The Internal Argument

From the perspective of microarchitecture, the RISC versus CISC debate of the 1980s was never really a debate. Every serious CPU architect from David Patterson to John Hennessy to Bob Colwell (lead architect of the Pentium Pro) arrived at the same conclusion: the execution core wants fixed-width, register-to-register operations with a small number of simple formats.

The reasons are structural. An out-of-order execution engine needs to do several things in parallel every cycle: read source operands from a register file, dispatch operations to functional units, write results back, and check for completion. All of these steps are easier when every operation has the same width, the same number of source and destination operands, and a predictable format. A fixed-format operation can be decoded in one gate delay. A variable-format operation requires a chain of multiplexers and priority encoders to figure out which fields are present, which adds latency and area.

Register renaming needs to know exactly how many source and destination registers each operation references. For a fixed-format RISC instruction, this is trivially determined by the opcode class. For an x86 instruction that might read two registers and a memory location, write a register and a set of flags, and implicitly reference the stack pointer, the rename logic has to handle a variable number of operands per instruction, which complicates the rename table and the dependency tracking.

The reservation station needs to compare source tags against all completing operations every cycle. The width of this comparison network grows quadratically with the number of entries and the number of source operands per entry. Keeping the number of sources fixed at two or three (as micro-ops do) bounds this cost.

The solution that Intel adopted for the Pentium Pro in 1995 is to translate CISC instructions into an internal RISC-like format at the front of the pipeline, and build the execution core around that internal format. The translation adds decode latency and power, but the execution core can be built with the same techniques that made MIPS R10000 and Alpha 21264 fast. The x86 ISA becomes a compatibility layer wrapped around a RISC engine.

The Pentium Pro Revolution

The Pentium Pro (P6 microarchitecture, November 1995) was the first x86 processor to implement full out-of-order execution with register renaming and micro-op decomposition. Bob Colwell's team at Intel Oregon designed it with a clear philosophy: the front end translates x86 into micro-ops, and the back end is a conventional superscalar out-of-order engine that does not know or care about x86.

The P6 front end had three decoders. One "complex" decoder could handle instructions that decomposed into up to four micro-ops. Two "simple" decoders each handled instructions that mapped to exactly one micro-op. The maximum decode throughput was therefore 4+1+1 = 6 micro-ops per cycle, but in practice the complex decoder was rarely needed for more than one micro-op, so sustained throughput was closer to 3 micro-ops per cycle for typical code.

The micro-ops produced by the decoders were placed into a reorder buffer (ROB) of 40 entries, which tracked them from dispatch through execution to retirement. Each micro-op carried an opcode (from an internal opcode space, not the x86 opcode), two source register tags, one destination register tag, an immediate value, and control flags. The format was fixed-width and regular, with no variable-length encoding, no prefixes, no ModR/M ambiguity. The execution core saw a clean stream of operations that could be scheduled and dispatched using textbook superscalar techniques.

The performance impact was dramatic. The Pentium Pro, running at 200 MHz, matched or exceeded the performance of contemporary RISC processors (MIPS R10000, Alpha 21164) at comparable clocks, despite the overhead of x86 decode. It proved that the CISC tax was manageable if you were willing to spend transistors on decode logic, and it set the template for every subsequent x86 design. The P6 architecture evolved through the Pentium II, Pentium III, and Pentium M, but the basic structure (three decoders feeding micro-ops into an OOO back end) persisted for a decade.

How Decode Works: Simple, Complex, And MSROM

Modern Intel and AMD processors have refined the decode stage into a multi-tier system. The details differ between vendors, but the general structure is consistent. Here is how Intel's decode pipeline works from Skylake through Golden Cove (the core in Alder Lake and Sapphire Rapids).

Instruction Fetch And Pre-Decode

The instruction fetch unit pulls 16 bytes per cycle from the L1 instruction cache (or from the branch prediction unit's target buffer if a taken branch redirected fetch). These 16 bytes are fed into a pre-decode stage that determines instruction boundaries. The pre-decoder scans the byte stream, identifies prefix bytes, locates opcode bytes, and marks the start and end of each instruction. This is the stage that solves the variable-length problem, and it is one of the most complex pieces of combinational logic on the chip.

The pre-decoder on modern Intel cores can identify up to six instruction boundaries per cycle in the 16-byte window. If an instruction spans the 16-byte boundary, it is split across two fetch cycles, which costs a bubble in the decode pipeline. Compilers and assemblers work to avoid this by aligning branch targets and loop headers to 16-byte or 32-byte boundaries.

The Decode Stage

After pre-decode, the identified instructions are dispatched to one of four or five decoders (depending on the microarchitecture). Intel's current designs (from Skylake onward) have four simple decoders and one complex decoder, for a peak decode throughput of 5 instructions per cycle. AMD's Zen 4 has four decoders that can each handle instructions up to 2 micro-ops, for a peak of 4 instructions (up to 8 micro-ops) per cycle.

A simple decoder handles instructions that translate to exactly one micro-op. This covers the majority of the instruction set: register-to-register ALU operations, simple loads, simple stores, conditional and unconditional branches, and most SSE/AVX operations on registers. Examples:

ADD RAX, RBX          -> 1 uop (ALU)
MOV RCX, [RDI]        -> 1 uop (load)
CMP R8, R9            -> 1 uop (ALU)
JE .target            -> 1 uop (branch)
VADDPS YMM0, YMM1, YMM2 -> 1 uop (FP ALU)

The complex decoder handles instructions that translate to two, three, or four micro-ops. These are instructions that combine a memory access with an ALU operation, or that have implicit side effects. Examples:

ADD [RDI], RAX        -> 2 uops (load + ALU-store)
PUSH RBX              -> 1 uop* (micro-fused store + RSP update, handled by stack engine)
CALL .function        -> 2 uops (push return address + jump)
XCHG RAX, [RDI]      -> ~8 uops (MSROM, because of implicit LOCK)
LEAVE                 -> 3 uops (MOV RSP,RBP + POP RBP)

Note that PUSH is listed as 1 micro-op on modern Intel cores despite being logically two operations (decrement RSP, store to [RSP]). This is because the stack engine handles the RSP update separately, outside the OOO engine. We will return to this.

The MSROM

Instructions that decompose into more than four micro-ops cannot be handled by the regular decoders. They are dispatched to the Microcode Sequencer ROM (MSROM), a small state machine that generates a sequence of micro-ops from a microcode program stored in ROM. While the MSROM is active, the regular decoders are stalled, which makes microcoded instructions expensive.

Examples of microcoded instructions:

REP MOVSB             -> variable length, tens to hundreds of uops depending on RCX
CPUID                 -> ~100+ uops
WRMSR                 -> microcoded (privileged)
DIV R64               -> ~30-40 uops on Intel, ~15-20 on AMD Zen 4
XSAVE                 -> microcoded (variable length, depends on state mask)

The MSROM is also used for microcode assists, which handle rare conditions like denormal floating-point operands or page table walks that require software intervention. These assists are invisible to the programmer but show up as performance counters (the ASSISTS.ANY counter on Intel).

The practical rule for performance-sensitive code: avoid microcoded instructions on the hot path. Use simple instructions that decode 1:1 wherever possible. This is why hand-tuned inner loops in libraries like glibc's memcpy use sequences of VMOVDQU loads and stores instead of REP MOVSB, even though REP MOVSB is optimised in microcode on modern processors. On Ice Lake and later, REP MOVSB is competitive above roughly 256 bytes, but the startup cost of the microcode sequence is still several cycles.

What A Micro-Op Contains

A micro-op is the atomic unit of work inside the execution engine. Its exact format is proprietary and undocumented, but from patent filings, microarchitectural studies, and inference from performance counters, the general structure is well understood.

Each micro-op contains:

Opcode: an internal opcode from a much simpler namespace than x86. Roughly 200 to 400 distinct operations, covering integer ALU, floating-point ALU, SIMD, load, store-address, store-data, branch, and special operations.
Two source register specifiers: pointing into the physical register file (after renaming). These are tags, not architectural register names.
One destination register specifier: also a physical register tag.
Immediate data: a small constant value, if the instruction carried one.
Flags control: which condition flags (ZF, CF, SF, OF, PF) the micro-op reads and/or writes.
Memory attributes: for load and store micro-ops, the memory address computation parameters (base, index, scale, displacement) or a reference to a prior address-generation micro-op.
Dispatch port mask: which execution port(s) can execute this micro-op.
Micro-op type: a classification used by the scheduler (integer, FP, load, store, branch).

The total width of a micro-op entry in the ROB is estimated at 80 to 120 bits on modern Intel cores, based on die area analysis and patent descriptions. This is much wider than a RISC instruction (32 bits for AArch64), but the micro-op carries all the information needed for scheduling and execution in a single, uniform packet. The out-of-order engine reads fixed fields at fixed offsets. The rename stage knows that sources are always in bits [X:Y] and the destination is always in bits [Z:W]. This regularity is what makes wide issue (6 to 8 micro-ops per cycle on Golden Cove) tractable.

Why Cracking Helps: Scheduling, Renaming, And Dispatch

With the micro-op format established, we can trace why cracking x86 instructions into micro-ops makes the rest of the pipeline work.

Register Renaming

x86-64 has 16 general-purpose architectural registers (RAX through R15) and a FLAGS register. This is a tiny namespace for an out-of-order engine that might have 300 or more instructions in flight. Without renaming, two independent instruction sequences that both use RAX would be serialised, not because of a true data dependency, but because they share the same architectural register name.

Register renaming maps each architectural register reference to a physical register from a much larger physical register file (PRF). Intel's Golden Cove has 280 integer physical registers and 332 FP/vector physical registers. Each micro-op's destination gets a fresh physical register, and the mapping table (RAT, Register Alias Table) is updated to reflect the new mapping. Subsequent micro-ops that read the same architectural register get the physical register tag from the RAT, creating a true data dependency chain. Subsequent micro-ops that write the same architectural register get a new physical register, breaking the false dependency.

This works cleanly because each micro-op has exactly one destination and at most two explicit sources. The RAT needs to perform at most two reads and one write per micro-op per cycle, and it processes multiple micro-ops in parallel (6 per cycle on Golden Cove). If x86 instructions were fed directly into the rename stage, the variable number of implicit reads and writes per instruction would make the RAT design much harder. A single PUSHA instruction (which pushes eight registers onto the stack) would need eight source reads and a stack pointer write, blowing the port budget of any reasonable RAT. By cracking such instructions into micro-ops first, each one fits the fixed rename interface.

The Scheduler (Reservation Station)

After renaming, micro-ops are placed into the scheduler (also called the reservation station on Intel, the scheduling queue on AMD). The scheduler holds micro-ops that are waiting for their source operands to become ready. Each cycle, the scheduler checks which micro-ops have all sources ready, picks a subset (up to 6 on Golden Cove), and dispatches them to execution ports.

The readiness check is a tag-matching network. When an execution unit produces a result, it broadcasts the destination physical register tag on a result bus. Every scheduler entry compares this tag against its source tags, and when a match occurs, the corresponding source is marked ready. This comparison happens every cycle for every scheduler entry against every result bus, so the area grows as (scheduler_entries * source_count * result_buses). Keeping the source count fixed at two per micro-op bounds one dimension of this product.

Intel's Golden Cove has a 97-entry integer scheduler and a 128-entry FP/vector scheduler. AMD's Zen 4 has a somewhat different organisation, with per-cluster schedulers, but the principle is the same: fixed-format micro-ops make the tag-matching hardware tractable.

Dispatch And Execution Ports

Modern x86 cores have between 6 and 12 execution ports, each connected to one or more functional units. A micro-op is dispatched to a specific port based on its type. On Golden Cove:

Port 0: Integer ALU, FP/Vector ALU, FP DIV, Branch
Port 1: Integer ALU, FP/Vector ALU, Integer MUL
Port 2: Load (AGU + data)
Port 3: Load (AGU + data)
Port 4: Store data
Port 5: Integer ALU, FP/Vector ALU, Vector shuffle
Port 6: Integer ALU, Branch
Port 7: Store AGU
Port 8: Store data
Port 9: Store AGU
Port 10: Integer ALU
Port 11: Load (AGU + data)

That is 12 ports, capable of dispatching up to 12 micro-ops per cycle (though the rename/allocate stage limits this to 6 per cycle). The key insight is that each micro-op goes to exactly one port and uses exactly one functional unit for exactly one cycle (for simple operations). This one-to-one mapping is possible because the cracking stage has already decomposed complex x86 instructions into simple operations that each fit a single functional unit.

Consider ADD [RDI], RAX. This x86 instruction reads from memory, adds, and writes back to memory. It decomposes into micro-ops that flow through separate ports:

uop 1: LOAD tmp, [RDI]      -> dispatched to port 2 or 3 (load unit)
uop 2: ADD tmp, tmp, RAX    -> dispatched to port 0, 1, 5, 6, or 10 (integer ALU)
uop 3: STORE [RDI], tmp     -> dispatched to port 7 or 9 (store AGU) + port 4 or 8 (store data)

Without cracking, the execution engine would need a combined load-ALU-store functional unit, which would be slower (serial execution within the unit) and less flexible (the load port and ALU port are tied together even when one is idle).

Macro-Op Fusion: Undoing The Split Before It Happens

After all this effort to decompose x86 instructions into micro-ops, the processor turns around and fuses some of them back together. The first fusion mechanism is macro-op fusion, which combines two consecutive x86 instructions into a single micro-op before they enter the OOO engine.

The canonical example is a comparison followed by a conditional branch:

cmp rax, rbx
je  .equal

Without fusion, this produces two micro-ops: one for the CMP (which sets flags) and one for the JE (which reads flags and branches). With macro-op fusion, the decoder recognises the pair and produces a single "compare-and-branch" micro-op that does both operations in one shot. This fused micro-op consumes one slot in the ROB, one slot in the scheduler, and one dispatch port. The savings cascade through the entire pipeline: one fewer rename, one fewer scheduler entry, one fewer dispatch, one fewer retirement.

Which Pairs Fuse

Macro-op fusion is not universal. The rules depend on the microarchitecture and are documented in Intel's optimisation manual (Volume 1, Section 3.4.2.2 for Skylake-era). The general pattern on modern Intel cores (Skylake through Raptor Lake):

First instruction (flag-setting): CMP, TEST, ADD, SUB, AND, INC, DEC. The instruction must write flags.

Second instruction (conditional branch): JA, JAE, JB, JBE, JE, JNE, JG, JGE, JL, JLE, JO, JNO, JS, JNS, JP, JNP. In practice, all conditional jumps.

Restrictions on Intel:

The first instruction must not have both an immediate and a displacement (i.e., CMP [mem], imm does not fuse on some microarchitectures).
The first instruction must not use RIP-relative addressing (this restriction was lifted on Skylake and later).
Both instructions must be within the same 16-byte decode window.
The branch must immediately follow the flag-setting instruction with no intervening instructions.

AMD Zen 4 fusion rules are similar but not identical. AMD fuses CMP/TEST with conditional branches, but the handling of ADD/SUB with branches depends on the specific encoding. AMD's documentation in the Software Optimization Guide for Zen 4 lists the supported pairs.

Here is a concrete example showing the difference:

; Fusible pair: produces 1 uop
cmp rax, 42
jne .loop
 
; Non-fusible: AND with memory destination, produces 2 uops
and [rdi], rax
je .done
 
; Fusible pair: TEST + branch
test ecx, ecx
jz .zero

Why Compilers Care

Compilers arrange code to enable macro-op fusion. GCC and Clang, when targeting modern x86, will place a CMP or TEST immediately before the conditional branch that consumes its flags, even if it means reordering instructions. The compiler will avoid inserting flag-clobbering instructions between the comparison and the branch. This is a codegen optimisation that is invisible in the source code but measurable in the pipeline: a fused CMP+JCC pair consumes one ROB entry instead of two, which effectively extends the instruction window by one slot. On a tight loop that executes billions of iterations, that extra slot translates to measurable throughput improvement.

A fused pair also consumes only one decoder slot instead of two, which means the 5-wide decode stage (on Intel) effectively becomes 6-wide for code with frequent fusible pairs. In branchy code (interpreters, virtual machine dispatch loops, decision trees), macro-op fusion can increase effective decode bandwidth by 10 to 15 percent.

Micro-Op Fusion: Keeping Memory Operations Cheap

Micro-op fusion is a distinct mechanism from macro-op fusion, and it operates at a different stage of the pipeline. Where macro-op fusion combines two x86 instructions into one micro-op, micro-op fusion keeps a single x86 instruction that would normally crack into two micro-ops as a single fused micro-op through rename and dispatch, unfusing it only at execution.

The typical case is a register-memory ALU instruction:

add rax, [rdi + rcx*4 + 8]

This instruction logically performs two operations: a load from memory, and an addition. Without micro-op fusion, the decoder would produce two micro-ops (a load and an add), each consuming a rename slot and a scheduler entry. With micro-op fusion, the decoder produces a single fused micro-op that occupies one rename slot and one ROB entry. The fused micro-op is "unfused" at dispatch time into its two component operations, which go to separate execution ports (a load port and an ALU port). The benefit is that the fused micro-op saves bandwidth in the narrowest stages of the pipeline: decode, rename, and retirement.

Which Operations Fuse

On Intel (Skylake through Raptor Lake), the following combinations support micro-op fusion:

Load + ALU: any register-memory form of ADD, SUB, AND, OR, XOR, CMP, TEST with addressing modes that use at most base+displacement or base+index. Complex addressing modes with base+index+displacement also fuse on Skylake and later (this was restricted on earlier cores like Sandy Bridge, where indexed addressing modes did not fuse).
Store: a store instruction is always represented as two micro-ops internally (store-address and store-data), and these are always kept fused through rename.

Instructions that do NOT micro-fuse on Intel:

Three-operand VEX-encoded instructions (VADDPS) that use memory operands do fuse on Skylake and later, but some AVX-512 forms do not.
Instructions with segment override prefixes typically do not fuse.

AMD's Zen architecture handles this differently. Zen uses a "macro-op" internal format that can represent up to two operations (a load or store combined with an ALU operation) natively, so the concept of micro-op fusion is less distinct. AMD's documentation describes this as "fastpath double" instructions: instructions that decode into two macro-ops on the fast path. The effect is similar to Intel's micro-op fusion, but the internal representation and the pipeline stages where fusing/unfusing occurs are different.

The Bandwidth Argument

Why does micro-op fusion matter? The answer is pipeline bandwidth. Consider Golden Cove's pipeline widths:

Decode:    5 instructions/cycle (up to ~6 uops/cycle)
Rename:    6 uops/cycle
Scheduler: 6 dispatches/cycle (to 12 ports)
Retire:    8 uops/cycle

Rename at 6 wide is the bottleneck for sustained throughput. Every micro-op that can be kept fused through rename effectively widens the rename stage for free. If half of the micro-ops in a typical workload are fused load+ALU pairs, the rename stage is handling 6 fused micro-ops per cycle that represent 9 actual operations, a 50% effective bandwidth increase through the narrowest stage.

In practice, other bottlenecks (cache misses, branch mispredictions) often dominate. But on tight, compute-bound loops with memory operands, micro-op fusion is the difference between saturating the execution engine and leaving ports idle.

Move Elimination: Zero Cost Register Copies

Move elimination is a third optimisation that operates at the rename stage, and it is beautifully simple. When the rename stage encounters a register-to-register MOV instruction:

mov rax, rbx

Instead of allocating a new physical register and dispatching an ALU operation to copy the value, the rename stage points RAX's entry in the RAT to the same physical register that currently holds RBX's value. No data is copied. No execution port is consumed. No latency is added to the dependency chain. The MOV effectively disappears, consuming only a rename slot and a ROB entry.

Move elimination was introduced in Sandy Bridge (2011) for integer registers and extended to vector registers in Ivy Bridge. On Golden Cove, it applies to:

MOV r64, r64 (64-bit register to register)
MOV r32, r32 (32-bit, which zero-extends to 64 bits)
MOVAPS/MOVAPD/MOVDQA xmm, xmm and their YMM/ZMM equivalents

It does not apply to:

MOV r8, r8 or MOV r16, r16 (partial register moves, because these do not overwrite the full register and create merge dependencies)
MOVZX or MOVSX (these require computation)
Any move involving memory

Move elimination is not always successful. It depends on the availability of physical register file entries and the microarchitectural state. Intel's performance counters report eliminated and non-eliminated moves separately (MOVE_ELIMINATION.INT_ELIMINATED and related counters), and in practice elimination rates above 95% are normal for code that uses register moves heavily. The impact is significant for compiler-generated code where register allocation is not perfect, and in calling-convention transitions where arguments are shuffled between registers.

The Micro-Op Cache: Bypassing Decode Entirely

The decode stage is expensive: it consumes significant die area, draws substantial power, and adds several cycles of latency to the pipeline. The micro-op cache exists to bypass it for code that has been decoded recently.

Intel introduced the micro-op cache (called the Decoded Stream Buffer, or DSB) in Sandy Bridge (2011). AMD introduced their version (called the Op Cache) in Zen (2017). Both cache decoded micro-ops so that the next time the same instruction address is fetched, the micro-ops can be read directly, skipping pre-decode and decode entirely.

Intel DSB (Sandy Bridge Through Raptor Lake)

The DSB on Sandy Bridge held 1,536 micro-ops, organised as 32 sets of 8 ways, with each way holding up to 6 micro-ops corresponding to a 32-byte aligned window of instruction bytes. Golden Cove grew this to 4,096 micro-ops and can deliver up to 8 micro-ops per cycle, compared to the 5 instructions per cycle from the legacy decode pipeline.

When the branch prediction unit predicts the next instruction address, it first checks the DSB. If the micro-ops for that address are cached (a "DSB hit"), they are read from the cache and sent directly to the allocation/rename stage, bypassing the L1I cache read, the pre-decode, and the decode stages. If the DSB misses, the instruction fetch and decode pipeline (called the "MITE" path, for Macro Instruction Translation Engine) activates and produces the micro-ops the slow way.

The performance difference between DSB hits and MITE decode is substantial. The DSB can deliver 8 micro-ops per cycle with lower latency (2 to 3 cycles from prediction to rename) compared to MITE's 5 instructions per cycle with higher latency (4 to 5 cycles from prediction to rename). For hot inner loops that fit in the DSB, the decode bottleneck vanishes.

DSB coverage is measurable through performance counters:

IDQ.DSB_UOPS          - micro-ops delivered from the DSB
IDQ.MITE_UOPS         - micro-ops delivered from the legacy decode pipeline
IDQ.MS_UOPS           - micro-ops delivered from the MSROM

On a well-optimised server workload, DSB hit rates above 80% are typical. On workloads with very large code footprints (large C++ applications, JIT-compiled code with many compilation units), DSB hit rates can drop below 50%, and the decode front end becomes the bottleneck.

AMD Op Cache (Zen Through Zen 5)

AMD's Op Cache in Zen 4 holds 6,144 macro-ops (AMD's term for their internal decoded operations), organised as a larger and more associative structure than Intel's DSB. It can deliver up to 9 macro-ops per cycle. Zen 4 also has a wider legacy decode path (4 decoders that can each produce up to 2 macro-ops), so the penalty for Op Cache misses is somewhat lower than on Intel.

AMD's Op Cache has grown significantly across Zen generations:

Zen 1:  2,048 entries
Zen 2:  4,096 entries
Zen 3:  4,096 entries
Zen 4:  6,144 entries (9 ops/cycle delivery)
Zen 5:  8,192 entries (reported)

Both vendors are investing heavily in the micro-op cache because it is the most effective way to amortise the cost of x86 decode.

Loop Stream Detector

Some Intel microarchitectures (Sandy Bridge through Skylake) included a Loop Stream Detector (LSD) that could replay micro-ops for small loops (up to 64 micro-ops) without fetching from either the DSB or the MITE. The LSD was a tiny loop cache in the allocation stage. Intel disabled the LSD via microcode update on Skylake and later due to a bug (the eLSD erratum), and it has not been re-enabled. The DSB serves a similar purpose for small loops, though with higher power consumption than the LSD would have had.

The Stack Engine: Hiding PUSH And POP

The x86 calling convention makes heavy use of PUSH and POP for stack management. A typical function prologue and epilogue looks like this:

push rbp
mov  rbp, rsp
sub  rsp, 0x40
; ... function body ...
add  rsp, 0x40
pop  rbp
ret

Each PUSH and POP implicitly modifies the RSP register: PUSH decrements it by 8, POP increments it by 8. If these RSP updates went through the full OOO pipeline (rename, schedule, execute), they would create a long serial dependency chain on RSP, limiting instruction-level parallelism. Every PUSH depends on the previous PUSH's RSP update, and every POP depends on the previous POP's RSP update.

The stack engine solves this by tracking RSP updates outside the OOO engine, using a dedicated hardware counter (the "stack engine delta" or "RSP offset"). When the decoder sees a PUSH, POP, CALL, or RET, it adjusts the stack engine's offset counter without generating a micro-op for the RSP update. The store (for PUSH) or load (for POP) uses the stack engine's predicted RSP value for address generation, which is available immediately without waiting for any prior micro-op to execute.

The stack engine works correctly as long as the code only modifies RSP through PUSH, POP, CALL, RET, and SUB/ADD RSP, imm. When an instruction reads RSP explicitly (like MOV RAX, RSP or LEA RBX, [RSP+8]), the stack engine inserts a synchronisation micro-op that flushes the accumulated delta into the architectural RSP in the OOO engine.

On Agner Fog's instruction tables, PUSH and POP on Skylake are listed as 1 micro-op each (the store or load), with a note that the RSP update is handled by the stack engine. Without the stack engine, each would be 2 micro-ops (the memory operation plus an ALU operation for the RSP update), which would halve the effective PUSH/POP throughput and create dependency chains that serialise stack operations.

The stack engine exists solely because of the x86 calling convention's reliance on an implicit stack pointer. AArch64's STP/LDP instructions with pre/post-index addressing do not require a separate engine because ARM's load/store architecture keeps the pointer update explicit and schedulable.

The Decode Bottleneck: Why Width Is Expensive

If cracking instructions into micro-ops is so effective, why not just make the decode stage wider? Why is Intel stuck at 5-wide decode and AMD at 4-wide?

The answer is a combination of timing, power, and complexity constraints.

Timing

The hardest part of x86 decode is instruction length determination. Because instructions are variable-length and the length depends on the content of the instruction (specifically, the opcode and the presence of ModR/M and SIB bytes), the decoder must serially scan the byte stream to find instruction boundaries. A 4-wide decoder needs to find four boundaries per cycle, which means four serial length computations within a single clock period.

In practice, the pre-decoder uses speculative length computation: it starts scanning from multiple possible offsets in parallel and resolves which starting points are correct after the first instruction's length is determined. This speculation hardware grows superlinearly with decode width. Going from 4-wide to 6-wide decode would roughly double the pre-decoder's area and add significant latency.

Power

The decode stage on a modern Intel core consumes an estimated 10 to 15 percent of the core's total power budget. The pre-decode, decode, and instruction queue together include hundreds of thousands of gates doing combinational logic every cycle. Widening the decode stage by 50% would increase this power by a similar factor, eating into the power budget available for execution.

This is precisely why the micro-op cache is so valuable. When the DSB is servicing the front end, the entire legacy decode pipeline (MITE) can be power-gated, saving the 10-15% of core power that decode would have consumed. For workloads with high DSB hit rates, the x86 decode tax is paid once (the first time the code is encountered) and amortised to near zero for subsequent iterations.

The Effective Width Argument

The combination of the micro-op cache, macro-op fusion, and micro-op fusion means that the effective front-end width is much larger than the raw decode width would suggest.

Consider a hot loop on Golden Cove:

The DSB delivers 8 micro-ops per cycle (wider than the 5-wide MITE decode).
Macro-op fusion converts some instruction pairs into single micro-ops, increasing the effective instruction throughput.
Micro-op fusion keeps load+ALU pairs as single entries through rename, increasing the effective operation throughput.

The net result is that the OOO engine can often sustain 6 or more operations per cycle even though the raw decode width is only 5 instructions per cycle. The decode stage is only the bottleneck when the DSB misses, which is why Intel (and AMD) have been growing the micro-op cache aggressively with each generation rather than widening the decode stage.

AMD vs Intel: Different Paths To The Same Goal

Intel and AMD both crack x86 instructions into internal operations and both apply fusion, but the implementation details differ in ways that affect real-world performance.

Decode Organisation

Intel (Golden Cove): 5 decoders (1 complex + 4 simple), producing up to 6 micro-ops per cycle. The complex decoder can handle instructions that produce up to 4 micro-ops.

AMD (Zen 4): 4 decoders, each capable of handling instructions that produce up to 2 macro-ops, for a peak of 8 macro-ops per cycle. AMD's approach trades decoder count for per-decoder capability.

AMD (Zen 5): 8 decoders reported, with 2 complex and 6 simple, a significant widening. This puts Zen 5 at 8 instructions per cycle decode throughput, exceeding Intel's current designs.

Internal Operation Format

Intel uses "micro-ops" (uops), which are strictly simple operations (one ALU operation, one load, or one store). A complex x86 instruction is cracked into multiple uops.

AMD uses "macro-ops" (mops), which can represent slightly more complex operations. Some AMD macro-ops encode a load+ALU pair as a single macro-op natively, without a separate fusion step. This means AMD's internal format is somewhat wider per entry but requires fewer entries for the same code.

Fusion Rules

Macro-op fusion (CMP/TEST + branch): both Intel and AMD support this. AMD added support for more flag-setting instructions (ADD, SUB, INC, DEC, AND, OR, XOR fused with conditional branches) starting with Zen 2, while Intel has supported a broader set since Core 2.

Micro-op fusion: Intel-specific in the strict sense. AMD's wider macro-op format achieves a similar effect natively.

Move elimination: both support it for 32-bit and 64-bit integer moves and for vector register moves. Intel's implementation has been more consistent across generations; AMD's Zen 1 had limited move elimination that was expanded in Zen 2 and later.

Micro-Op Cache

                    Intel Golden Cove    AMD Zen 4
Op cache entries:   ~4,096 uops          6,144 mops
Delivery rate:      8 uops/cycle         9 mops/cycle
Organisation:       32B-aligned windows  64B-aligned windows (2x 32B)

AMD's larger op cache and higher delivery rate give it an advantage on workloads with large instruction footprints. Intel's DSB is smaller but has been sufficient for many workloads due to efficient encoding of micro-ops in the cache.

MSROM Differences

Both vendors use microcode for complex instructions, but the implementations can differ significantly. For example:

                    Intel (Skylake)      AMD (Zen 4)
DIV r64:           ~36 uops             ~15 mops
IDIV r64:          ~36 uops             ~15 mops
REP MOVSB (small): ~30+ uops startup    ~20+ mops startup
CPUID:             ~100+ uops           ~40+ mops

AMD's more efficient microcode for division and string operations is a real advantage in code that uses these instructions frequently. The difference is partly because AMD's macro-op format can encode more work per entry, and partly because AMD has optimised the microcode more aggressively for common cases.

Comparison With ARM: The CISC Tax In Perspective

ARM's AArch64 instruction set was designed in 2011 with the benefit of three decades of hindsight about what makes decode easy. Every AArch64 instruction is exactly 4 bytes. The opcode is always in the same bit positions. Register specifiers are always in the same bit positions. There are no prefixes, no variable lengths, no ModR/M ambiguities.

This means an AArch64 decoder can determine instruction boundaries trivially (every 4 bytes), identify the instruction type in one gate delay, and extract register specifiers in parallel with the decode. Apple's Firestorm core (in the M1, 2020) decodes 8 instructions per cycle. ARM's Cortex-X4 decodes 10 instructions per cycle. These widths would be prohibitively expensive for x86 decode, but they are tractable for fixed-length instructions.

Does this mean AArch64 avoids micro-ops entirely? No. Some AArch64 instructions still crack into multiple internal operations:

LDP X0, X1, [SP], #16 (load pair with post-index writeback): this loads two registers and updates the stack pointer, which typically cracks into 2 or 3 micro-ops internally.
STP X29, X30, [SP, #-16]! (store pair with pre-index writeback): similar, 2 to 3 micro-ops.
LDAXR / STLXR (load-acquire-exclusive / store-release-exclusive): these memory ordering instructions can produce multiple micro-ops depending on the implementation.

But the vast majority of AArch64 instructions (over 90%, by dynamic frequency in typical code) decode 1:1 into internal operations. This is in contrast to x86, where Agner Fog's measurements show that roughly 20 to 30 percent of micro-ops in typical code come from multi-uop instructions (before fusion).

The measurable cost of x86 decode is visible in several ways:

Die area: Intel's decode block is estimated at 3 to 5% of the core die area, versus less than 1% for ARM's decode. That silicon could have been another scheduler, a larger cache, or additional execution units.
Power: x86 decode consumes an estimated 10 to 15% of core dynamic power. ARM's decode consumes roughly 3 to 5%. Over a billion mobile devices, this is a significant energy difference.
Front-end latency: the x86 MITE path adds 4 to 5 cycles of pipeline depth compared to ARM's 2 to 3 cycles for decode. Longer pipelines mean higher branch misprediction penalties.
Decode width: the widest shipping x86 decoder (AMD Zen 5 at 8-wide) matches ARM cores from 2020. ARM's latest cores are at 10-wide and planning for wider.

These costs are real but not dominant. On sustained compute workloads where the micro-op cache is effective, x86 cores match ARM cores at similar process nodes and similar power levels. The CISC tax is a constant overhead that gets amortised over the execution of each instruction; for long-running computation, the amortisation makes the tax negligible. For workloads that are front-end bound (interpreters, branch-heavy code, large code footprints that exceed the micro-op cache), the tax is measurable and sometimes significant.

The CISC Tax: Real Numbers

We can quantify the x86 decode overhead using hardware performance counters. Here are representative measurements from a Skylake-era Xeon running SPEC CPU2017 integer benchmarks, collected with perf:

Benchmark         DSB hit%   MITE%   MS%    IPC    Front-end bound%
500.perlbench     62%        35%     3%     1.8    28%
502.gcc           55%        41%     4%     1.5    35%
505.mcf           78%        21%     1%     0.6    8%
520.omnetpp       48%        49%     3%     0.9    42%
523.xalancbmk     51%        46%     3%     1.1    38%
531.deepsjeng     92%        7%      1%     2.1    12%
541.leela         89%        10%     1%     2.3    14%
557.xz            85%        14%     1%     1.9    18%

The pattern is clear. Benchmarks with compact, loopy code (deepsjeng, leela, xz) have high DSB hit rates, high IPC, and low front-end boundedness. Benchmarks with large, branchy code (gcc, omnetpp, xalancbmk) have low DSB hit rates, low IPC, and high front-end boundedness. The decode stage is the bottleneck when the micro-op cache cannot cover the working set.

The "front-end bound" percentage comes from Intel's Top-Down Microarchitecture Analysis (TMA) methodology, which decomposes pipeline stalls into four categories: front-end bound, back-end bound (memory or core), retiring, and bad speculation. A front-end bound percentage above 20% indicates that the core is spending significant time waiting for the decode stage to deliver micro-ops.

For comparison, ARM cores running equivalent workloads typically show front-end bound percentages of 5 to 15%, because the simpler decode path has higher throughput and lower latency. The delta, roughly 10 to 20 percentage points of front-end boundedness, is a reasonable estimate of the CISC tax for decode-sensitive workloads.

Historical Progression: From P6 To Lion Cove

The cracking and fusing pipeline has evolved substantially across x86 generations. Here is a summary of the key changes at each major microarchitecture:

P6 (Pentium Pro, 1995)

First x86 OOO design with micro-op decomposition
3 decoders (1 complex, 2 simple)
40-entry ROB
No micro-op cache, no fusion
Proved that cracking CISC into micro-ops was viable

Pentium M / Dothan (2003)

Refined P6 with better branch prediction (indirect branch predictor)
Micro-op fusion introduced: load+ALU kept as single uop through rename
Still 3 decoders, but the pipeline was shortened to reduce power
First mobile-optimised x86 core, predecessor to Core

Core 2 / Merom (2006)

4 decoders (1 complex, 3 simple) for the first time
Macro-op fusion introduced: CMP/TEST + branch fused to single uop
96-entry ROB
128-bit SSE execution (one 128-bit operation per cycle)
Macro-op fusion initially limited to CMP/TEST with select branches; ADD/SUB fusion added in Penryn (2008)

Sandy Bridge (2011)

Micro-op cache (DSB) introduced: 1,536 entries, 4 uops/cycle delivery
Move elimination introduced for integer and vector registers
Physical register file replaced ROB-based register storage
168-entry ROB
Macro-op fusion expanded to more instruction pairs
AVX (256-bit) required cracking into two 128-bit micro-ops for some operations (not on Sandy Bridge, which had native 256-bit execution)

Haswell (2013)

192-entry ROB
AVX2 with full 256-bit integer execution
FMA (fused multiply-add) as a single micro-op
Improved DSB delivery bandwidth

Skylake (2015)

DSB grew to ~1,792 entries (reports vary)
Legacy decode path remained at 4+1 = 5 decoders
Micro-op fusion for indexed addressing modes restored (Sandy Bridge had dropped it)
224-entry ROB
This is the baseline for many "modern Intel" performance discussions

Golden Cove (Alder Lake / Sapphire Rapids, 2021)

512-entry ROB (a dramatic jump from Skylake's 224)
DSB grew to ~4,096 entries with 8 uops/cycle delivery
6-wide rename/allocate (up from 5 on Skylake)
12 execution ports (up from 8 on Skylake)
Improved macro-op fusion with wider coverage
First Intel hybrid design (paired with Gracemont E-cores, which have their own simpler 2-wide decode pipeline)

Lion Cove (Lunar Lake, 2024)

Further ROB growth (estimated 576 entries)
DSB improvements for better coverage
Improved branch prediction to reduce front-end restarts
Continued trend of growing the OOO window to hide memory latency, enabled by the fixed-format micro-op design

AMD Zen Evolution

AMD's parallel evolution is worth tracking:

Zen 1 (2017):  4 decoders, 2,048 op cache, 192-entry ROB
Zen 2 (2019):  4 decoders, 4,096 op cache, 224-entry ROB, improved fusion
Zen 3 (2020):  4 decoders, 4,096 op cache, 256-entry ROB, unified L3 per CCX
Zen 4 (2022):  4 decoders, 6,144 op cache, 320-entry ROB, AVX-512
Zen 5 (2024):  8 decoders (2 complex, 6 simple), 8,192 op cache, 448-entry ROB

Zen 5's jump to 8-wide decode is the most aggressive widening of x86 decode in the architecture's history, reflecting AMD's assessment that silicon spent on wider decode pays for itself in workloads with large instruction footprints.

Instruction Examples: From x86 Bytes To Micro-Ops

To make the decode process concrete, here are detailed examples of how specific x86 instructions decompose on a Skylake-class core, based on Agner Fog's instruction tables and Intel's optimisation manual.

Simple Register ALU

add rax, rbx
; Encoding: 48 01 D8 (3 bytes: REX.W + opcode + ModR/M)
; Decodes to: 1 uop
;   uop: ADD.q  pdst=RAX, psrc1=RAX, psrc2=RBX, flags=OSZAPC (write all)
; Ports: 0, 1, 5, 6 (any integer ALU)
; Latency: 1 cycle
; Throughput: 4 per cycle

Load From Memory

mov rcx, [rdi + 8]
; Encoding: 48 8B 4F 08 (4 bytes: REX.W + opcode + ModR/M + disp8)
; Decodes to: 1 uop
;   uop: LOAD.q  pdst=RCX, base=RDI, disp=8
; Ports: 2, 3 (load unit)
; Latency: 5 cycles (L1 hit)
; Throughput: 2 per cycle

Register-Memory ALU (Micro-Op Fusion)

add rax, [rdi + rcx*4 + 16]
; Encoding: 48 03 44 8F 10 (5 bytes: REX.W + opcode + ModR/M + SIB + disp8)
; Decodes to: 1 fused uop (load + add)
;   In rename/ROB: 1 entry (fused)
;   At dispatch: unfuses into 2 uops:
;     uop 1: LOAD.q  tmp, base=RDI, index=RCX, scale=4, disp=16  -> port 2 or 3
;     uop 2: ADD.q   pdst=RAX, psrc1=RAX, psrc2=tmp              -> port 0, 1, 5, or 6
; Latency: 6 cycles (5 load + 1 add)
; Throughput: 1 per cycle (limited by single load port pairing)

CMP + Branch (Macro-Op Fusion)

cmp rax, rbx
je  .target
; Encoding: 48 39 D8 (CMP, 3 bytes) + 0F 84 xx xx xx xx (JE near, 6 bytes)
; Without fusion: 2 uops (CMP + JE)
; With macro-op fusion: 1 uop
;   uop: CMP_JE.q  psrc1=RAX, psrc2=RBX, target=.target
; Ports: 0 or 6 (branch unit)
; Latency: 1 cycle (predicted branch)
; Throughput: 2 per cycle (one on port 0, one on port 6)

PUSH (Stack Engine)

push rbx
; Encoding: 53 (1 byte)
; Decodes to: 1 uop
;   uop: STORE.q [RSP-8], RBX  (RSP update handled by stack engine, not a uop)
;   Stack engine adjusts its internal offset by -8
; Ports: 7/9 (store AGU) + 4/8 (store data)
; Latency: N/A (store, no data result)
; Throughput: 2 per cycle

REP MOVSB (Microcoded)

rep movsb
; Encoding: F3 A4 (2 bytes)
; Decodes to: MSROM sequence, variable length
;   For RCX = 64:
;     ~30+ uops of setup (alignment checks, choosing copy strategy)
;     Then a series of wide load/store uops using 16B or 32B moves internally
;     Total: ~40-60 uops depending on alignment
;   For RCX = 4096:
;     Similar setup, then bulk copy using enhanced REP MOVSB optimisation
;     Total: ~100-200 uops, mostly load/store pairs
; During MSROM execution: legacy decoders are stalled
; Throughput: depends on copy size and alignment; roughly 32-64 bytes/cycle
;   for large aligned copies on Ice Lake and later

DIV (Microcoded, Variable Latency)

div rcx
; Encoding: 48 F7 F1 (3 bytes: REX.W + opcode + ModR/M)
; Decodes to: MSROM sequence
;   Intel Skylake: ~36 uops, latency 35-90 cycles depending on operand values
;   AMD Zen 4:    ~15 mops, latency 8-41 cycles
; Stalls the legacy decode pipeline during MSROM execution
; Uses the integer divider unit (port 0 on Intel, dedicated unit on AMD)

Where The Costs Still Bite

Despite three decades of optimisation, the x86 cracking and fusing pipeline has costs that are irreducible without changing the ISA.

The first execution of any code path is always decoded through the slow path. The micro-op cache only helps on the second and subsequent visits to the same instruction address. JIT compilers, interpreters, and self-modifying code see the full decode penalty on every new compilation or modification.

Large code footprints exceed the micro-op cache. A database query engine with hundreds of thousands of functions, or a web browser rendering engine with millions of lines of compiled C++, will have a code footprint that vastly exceeds the 4,096 or 6,144 entry micro-op cache. On these workloads, the MITE decode path is active for a large fraction of execution time, and the decode bottleneck is real.

Branch mispredictions flush the micro-op cache pipeline. After a misprediction, the front end must restart fetch from the correct path, and if the correct path is in the DSB, the restart penalty is smaller (2 to 3 cycles) than if the correct path must be decoded from scratch (5 to 7 cycles). But mispredictions are common (1 to 5% of branches in typical code), so the restart path matters.

Microcode assists are invisible performance killers. When the processor encounters an unusual condition (denormal float, page fault during a page walk, or a cache line split on an atomic operation), the MSROM generates a sequence of assist micro-ops. These are not visible in the instruction stream and are not cached in the DSB. They stall the front end for tens to hundreds of cycles and are difficult to debug because they do not correspond to any visible instruction.

The power cost of the decode stage is always present. Even when the micro-op cache is delivering micro-ops, the DSB itself consumes power (it is a large SRAM structure that is accessed every cycle). The MITE path can be power-gated during DSB hits, which helps, but the total front-end power (branch prediction, instruction cache, DSB, allocation queue) remains a significant fraction of core power.

These costs are the price of compatibility. The x86 ISA runs trillions of euros worth of existing software, and the decode overhead is the cost of running it without recompilation. Every time Intel or AMD ships a new microarchitecture, they can improve the cracking and fusing machinery without changing a single line of existing code.

Conclusion: Complexity As A Strategy

The x86 crack-and-fuse pipeline is one of the most complex pieces of digital logic ever built in mass production. It exists because of a specific historical accident: the 1978 8086 instruction encoding was designed for a world where memory was expensive and transistors were scarce, and the subsequent 46 years of backward compatibility have required every new generation of silicon to interpret those encodings faster and more efficiently.

The cracking step (decomposing CISC instructions into fixed-width micro-ops) made out-of-order execution possible on x86. The micro-op cache made decode overhead amortisable. Macro-op fusion reclaimed front-end bandwidth lost to the two-instruction compare-and-branch pattern. Micro-op fusion reclaimed rename bandwidth lost to memory-ALU instruction pairs. Move elimination reclaimed execution bandwidth lost to register shuffling. The stack engine reclaimed execution bandwidth lost to implicit stack pointer updates.

Each of these mechanisms was an engineering response to a specific bottleneck created by the previous mechanism. The cracking step created a decode bottleneck; the micro-op cache solved it. The cracking step inflated the micro-op count; fusion reduced it. The result is a pipeline that spends enormous complexity breaking things apart and then putting some of them back together, and the whole process is invisible to the programmer.

From the hardware engineer's view, it is an extraordinary achievement: a machine that runs a 1978 instruction set at 6 GHz with 6-wide issue and 500+ instructions in flight. From the ISA designer's view, it is a cautionary tale about the long-term cost of encoding decisions made under constraints that no longer apply. The crack-and-fuse pipeline is not going away. As long as x86 software exists, the silicon that interprets it will keep getting more sophisticated. The 8086's variable-length encoding is now, and will remain, the most expensive compatibility commitment in the history of computing.