← Back to Logs

How Video Compression Actually Works: H.264 Frame by Frame

Try the interactive lab for this articleTake the quiz (6 questions · ~5 min)

Open any video file on your computer and check its bitrate. A typical 1080p video on YouTube runs at 4 to 8 megabits per second. Now compute what the raw, uncompressed data rate would be for that same video. A 1080p frame has 1920 x 1080 pixels. Each pixel has three colour channels (red, green, blue), each stored as 8 bits. At 30 frames per second:

1920 × 1080 × 3 bytes × 30 fps = 186,624,000 bytes/sec ≈ 186 MB/s ≈ 1.49 Gbit/s

That is 670 GB per hour. A two-hour film would be 1.34 TB uncompressed. Your internet connection cannot deliver that. Your hard drive cannot store much of it. And the broadcast spectrum that carries television across Europe, carefully allocated by ETSI and national regulators, would be instantly overwhelmed.

The difference between 1.49 Gbit/s and 5 Mbit/s is a compression ratio of roughly 300:1. Some streaming services push it past 1000:1 for lower-quality tiers. The codec responsible for most of this on planet Earth, by deployment count, is H.264/AVC. It was ratified in 2003 by the ITU-T and ISO/IEC, and it still encodes the majority of video on the internet, in surveillance cameras, in videoconferencing, in television broadcast, and in Blu-ray discs. Newer codecs like HEVC, AV1, and VVC improve on it, but H.264 remains the baseline that everyone supports.

This post explains how H.264 compresses video, from the high-level frame structure down to the integer DCT transform, quantization tables, and entropy coding. If you have ever wondered what "CRF 23" means in FFmpeg, or why seeking to a random point in a video is not instant, or why video looks blocky on a bad connection, this is the post for you.

1. Frame Types: I, P, and B

Video is a sequence of frames, but H.264 does not treat them equally. Each frame is classified as one of three types, and the type determines how the encoder compresses it.

I-frames (Intra-coded frames) are self-contained. Every macroblock in the frame is encoded using only information from within the frame itself. No reference to any other frame is needed. An I-frame is a compressed still image, similar to a JPEG. I-frames are the largest frames in a compressed stream, typically 5 to 10 times bigger than P-frames, because they cannot exploit temporal redundancy (the fact that successive frames look nearly identical).

P-frames (Predicted frames) are encoded as differences from one or more previous reference frames. The encoder looks at the previously decoded frame (or multiple reference frames in H.264's multi-reference prediction) and says, "this 16x16 block is almost the same as the block at position (x+3, y-1) in the previous frame, except for these small residual differences." P-frames are much smaller than I-frames because most of the image data is already present in the reference.

B-frames (Bidirectional predicted frames) can reference both past and future frames. A B-frame might encode a block by saying, "take 60% of the block at this position in the previous frame and 40% of the block at this position in the next frame, then add this residual." B-frames achieve the best compression because they have two sources of prediction, but they introduce complexity: the encoder must process frames out of order (the future reference frame must be encoded before the B-frame that references it), and the decoder must buffer frames and reorder them for display.

GOP Structure

Frames are organized into Groups of Pictures (GOPs). A typical GOP starts with an I-frame and contains a repeating pattern of P and B frames. Common GOP patterns include:

I B B P B B P B B P B B P B B I B B P ...

This is sometimes written as "GOP length 15" with "B-frame count 2." The GOP length is the distance between I-frames. Shorter GOPs mean more I-frames, which means larger files but faster seeking and better error resilience. Longer GOPs mean better compression but slower random access.

Here is why seeking matters. When a video player jumps to a specific timestamp, it must find the nearest I-frame at or before that timestamp, decode it fully, then decode every P and B frame between that I-frame and the target frame. If the GOP length is 250 frames (about 8 seconds at 30 fps), seeking to a random point might require decoding up to 249 frames before the player can display the one you want. This is why some encoders insert I-frames at scene changes or at regular intervals, and why live streaming often uses shorter GOPs (1 to 2 seconds) at the expense of bitrate.

The display order and decode order of frames differ when B-frames are used. Consider this display order:

Display order:  I0  B1  B2  P3  B4  B5  P6
Decode order:   I0  P3  B1  B2  P6  B4  B5

The encoder must output P3 before B1 and B2 because those B-frames reference P3 as their forward reference. The decoder reorders frames after decoding to restore the correct display order. This reordering adds latency, which is why low-latency applications like video calls typically disable B-frames entirely.

2. Macroblocks and Slices

H.264 divides each frame into 16x16 pixel macroblocks. These are the fundamental units of encoding. Each macroblock is independently coded with its own prediction mode, motion vectors (if applicable), transform coefficients, and quantization parameters.

A macroblock contains:

  • A 16x16 luma (brightness) block
  • Two 8x8 chroma (colour) blocks (in the common 4:2:0 chroma subsampling format)

The chroma subsampling is important. Human vision is far more sensitive to brightness than to colour, so H.264 stores colour at half the horizontal and half the vertical resolution of brightness. A 1920x1080 frame in 4:2:0 has full 1920x1080 luma data but only 960x540 Cb and 960x540 Cr chroma data. This alone reduces the data by a factor of 2 compared to full 4:4:4 colour.

For a 1080p frame, there are 120 x 68 = 8,160 macroblocks (1920/16 x 1088/16; the height is rounded up to the nearest multiple of 16, so 1088 rather than 1080).

Macroblocks are grouped into slices, and slices provide error resilience boundaries. Each slice can be independently decoded. If part of a bitstream is corrupted or lost (as happens in broadcast or UDP streaming), only the affected slice is damaged; the rest of the frame can still be decoded. In many encoder configurations, a single frame is a single slice, but network-oriented profiles sometimes use multiple slices per frame.

3. Motion Estimation and Compensation

This is the core of inter-frame compression. The insight is simple: most of the time, consecutive video frames are nearly identical. A security camera pointed at a parking lot in Thessaloniki records 30 frames per second, but nothing moves for minutes at a time. Even in a fast-paced football match, the background (the pitch, the stands) is static, and the players occupy only a small portion of each frame.

Motion estimation is the encoder's process of finding, for each block in the current frame, the best matching block in the reference frame. The result is a motion vector: a displacement (dx, dy) that describes where the block "came from."

Block Matching

The simplest approach is full search (also called exhaustive search). For each macroblock in the current frame, the encoder compares it against every possible position within a search window in the reference frame. If the search window is ±32 pixels in each direction, that is 65 x 65 = 4,225 candidate positions per macroblock. For 8,160 macroblocks in a 1080p frame, that is over 34 million block comparisons per frame, each involving a 16x16 pixel comparison (256 pixel differences to compute). Full search guarantees the optimal motion vector but is computationally brutal.

The comparison metric is usually the Sum of Absolute Differences (SAD):

SAD(dx, dy) = Σ |current[x, y] - reference[x + dx, y + dy]|

where the sum runs over all pixels in the block.

H.264 encoders use faster search algorithms in practice:

Diamond search starts at the predicted motion vector (often (0,0) or the motion vector of the neighbouring block) and evaluates a diamond-shaped pattern of candidates. The search moves to the best candidate and repeats with a smaller diamond until convergence. Typically converges in 10 to 20 evaluations instead of 4,225.

Hexagonal search is similar but uses a hexagonal pattern, which better covers the search space with fewer points. The UMHexagonS algorithm used in the x264 encoder combines multiple search patterns: a small diamond, a hexagonal pattern, an uneven multi-hexagon, and extended search patterns for refinement.

Predictive search uses motion vectors from neighbouring blocks (left, above, above-right) as starting points, since motion tends to be locally coherent (if the camera is panning right, all blocks in the same region have similar motion vectors).

Variable Block Sizes

H.264 does not restrict motion estimation to 16x16 macroblocks. It supports partitioning each macroblock into smaller blocks for more precise motion compensation:

Partition Size Use case
16x16 Full macroblock Large uniform motion (panning)
16x8 Horizontal split Horizontal edges
8x16 Vertical split Vertical edges
8x8 Quarter macroblock Complex motion
8x4 Sub-partition Fine detail
4x8 Sub-partition Fine detail
4x4 Minimum Very complex motion

Smaller partitions produce better predictions (smaller residuals) but cost more bits to signal the partition type and the additional motion vectors. The encoder runs a rate-distortion optimisation to choose the partition size that minimizes the total cost: bits for motion vectors + bits for the residual.

Sub-Pixel Motion Estimation

Real motion rarely aligns to integer pixel boundaries. A person walking across the frame might move 2.75 pixels to the right between frames. H.264 supports quarter-pixel (qpel) motion estimation.

The process works in two stages:

  1. Integer-pixel search: Find the best integer-pixel motion vector using one of the algorithms above.
  2. Sub-pixel refinement: Interpolate half-pixel positions using a 6-tap FIR filter, then interpolate quarter-pixel positions by averaging adjacent integer and half-pixel positions. Search the sub-pixel positions around the best integer-pixel candidate.

The 6-tap filter for half-pixel interpolation uses coefficients [1, -5, 20, 20, -5, 1] / 32, which is close to a sinc interpolation (the theoretically optimal interpolation filter for band-limited signals).

Quarter-pixel precision roughly doubles the compression efficiency compared to integer-pixel motion estimation, at the cost of the interpolation computation.

What Gets Transmitted

After motion estimation, the encoder transmits:

  1. The motion vector (dx, dy) for each block partition, differentially coded relative to the predicted motion vector from neighbouring blocks.
  2. The residual: the difference between the actual block and the motion-compensated prediction. This residual is small (mostly near-zero values) if the motion estimation was accurate.

The residual then goes through the transform, quantization, and entropy coding pipeline described in the following sections.

4. The Transform: Discrete Cosine Transform

Raw pixel residuals are not efficient to compress directly. A block of residual pixels might look like this:

 4  3  2  1
 3  2  1  0
 2  1  0 -1
 1  0 -1 -2

There is clear structure here (a smooth gradient), but each value looks independent. The Discrete Cosine Transform (DCT) converts spatial pixel data into frequency coefficients, concentrating the signal energy into a few low-frequency components.

Why Frequency Domain?

Consider a smooth gradient across a block. In the spatial domain, you need to store every pixel value, and they are all nonzero. In the frequency domain, a smooth gradient is represented almost entirely by one or two low-frequency coefficients. The high-frequency coefficients are nearly zero. Since quantization and entropy coding are very efficient at representing "mostly zeros with a few significant values," transforming to the frequency domain dramatically improves compression.

H.264's Integer DCT

Traditional JPEG uses an 8x8 floating-point DCT. H.264 uses a 4x4 integer approximation. This is not an arbitrary simplification. Integer transforms have two critical advantages:

  1. Deterministic decoding. Floating-point arithmetic varies across implementations (different rounding, different precision, different instruction sets). The H.264 standard requires bit-exact decoding, so the transform must produce identical results everywhere. Integer arithmetic guarantees this.
  2. Lower complexity. The 4x4 integer DCT can be computed using only additions and shifts (no multiplications), which is faster and simpler in hardware.

The forward 4x4 transform matrix is:

       [ 1  1  1  1 ]
Cf =   [ 2  1 -1 -2 ]
       [ 1 -1 -1  1 ]
       [ 1/2 -1  1 -1/2 ]

More precisely, H.264 defines the core transform matrix as:

       [ 1  1  1  1 ]
Cf =   [ 2  1 -1 -2 ]
       [ 1 -1 -1  1 ]
       [ 1 -2  2 -1 ]

The forward transform of a 4x4 block X is computed as:

Y = Cf · X · Cf^T

followed by element-wise multiplication with a scaling factor matrix that absorbs the normalization (this scaling is folded into the quantization step to avoid extra multiplications).

The rows of the transform matrix represent basis patterns of increasing frequency:

  • Row 0: [1, 1, 1, 1] is DC (constant), the average value.
  • Row 1: [2, 1, -1, -2] is low frequency, one cycle across the block.
  • Row 2: [1, -1, -1, 1] is medium frequency, two half-cycles.
  • Row 3: [1, -2, 2, -1] is high frequency, roughly two cycles.

Example

Take this 4x4 residual block (the difference between the actual pixels and the predicted pixels):

 8  6  4  2
 6  4  2  0
 4  2  0 -2
 2  0 -2 -4

After the forward transform, most of the energy concentrates into the top-left (low-frequency) coefficients. The transformed block might look like:

 24   16   0   0
 16    8   0   0
  0    0   0   0
  0    0   0   0

(These numbers are simplified for illustration.) The original block had 16 nonzero values. The transformed block has only 4 nonzero values. The information is the same, but it is now in a form that compresses much better.

H.264 also defines an 8x8 integer DCT for the High profile, which captures longer-range correlations in smooth areas at the cost of reduced spatial adaptivity.

Hadamard Transform for DC Coefficients

For 16x16 intra prediction mode and for chroma blocks, H.264 applies an additional 4x4 Hadamard transform to the DC coefficients (the top-left coefficient of each 4x4 block). This exploits the correlation between the DC values of neighbouring blocks, which tends to be high in smooth regions, further improving compression.

5. Quantization: Where Quality Is Lost

Everything described so far is either lossless or nearly lossless. Motion estimation is lossless (the exact residual is preserved). The integer DCT is lossless (it is an invertible transform). Quantization is where information is permanently discarded. It is the only lossy step in the entire H.264 pipeline, and it is where the encoder trades quality for file size.

How Quantization Works

Quantization divides each DCT coefficient by a quantization step size and rounds to the nearest integer. Small coefficients (typically high-frequency ones, which represent fine detail) get rounded to zero. Large coefficients (typically low-frequency ones, which represent the overall structure) survive.

The formula is:

level = round(coefficient / Qstep)

and the inverse (dequantization) during decoding is:

coefficient' = level × Qstep

The difference between coefficient and coefficient' is the quantization error, and it is permanent. It can never be recovered. This is the fundamental source of quality loss in video compression.

The Quantization Parameter (QP)

H.264 uses a Quantization Parameter (QP) that ranges from 0 to 51. The relationship between QP and the actual quantization step size (Qstep) is exponential:

QP Qstep Effect
0 0.625 Minimal quantization, nearly lossless
10 1.25 Very high quality
18 3.2 High quality (typical for archival)
22 5.0 Good quality (common CRF default)
26 8.0 Medium quality (streaming)
30 12.7 Noticeable artifacts on close inspection
36 25.4 Visible degradation
40 40.3 Low quality
51 224 Extreme compression, severe artifacts

Every increase of 6 in QP doubles the Qstep, which roughly halves the bitrate. This exponential relationship means that QP 28 produces approximately half the bitrate of QP 22, and QP 34 produces approximately a quarter.

Quantization Example

Take the transformed block from the previous section:

 24   16   0   0
 16    8   0   0
  0    0   0   0
  0    0   0   0

With a Qstep of 8 (QP ≈ 26):

level[0,0] = round(24 / 8) = 3
level[0,1] = round(16 / 8) = 2
level[1,0] = round(16 / 8) = 2
level[1,1] = round(8 / 8) = 1
All others: 0

The quantized block is:

 3  2  0  0
 2  1  0  0
 0  0  0  0
 0  0  0  0

After dequantization (multiplying by 8):

 24  16  0  0
 16   8  0  0
  0   0  0  0
  0   0  0  0

In this example, the reconstruction is perfect because the original values happened to be exact multiples of 8. In practice, there is always some rounding error.

Now consider the same block with a higher QP. With Qstep of 25.4 (QP ≈ 36):

level[0,0] = round(24 / 25.4) = 1
level[0,1] = round(16 / 25.4) = 1
level[1,0] = round(16 / 25.4) = 1
level[1,1] = round(8 / 25.4) = 0

The quantized block is:

 1  1  0  0
 1  0  0  0
 0  0  0  0
 0  0  0  0

After dequantization:

 25.4  25.4  0  0
 25.4   0    0  0
  0     0    0  0
  0     0    0  0

The reconstruction is now noticeably different from the original. The fine gradient information is lost. This is exactly the kind of degradation that becomes visible as "blockiness" or "mosquito noise" in heavily compressed video.

Frequency-Dependent Quantization

H.264's quantization is not uniform across all frequencies. The standard defines scaling matrices that allow different quantization strengths for different frequency positions. The default flat matrix treats all frequencies equally, but the High profile allows custom quantization matrices (similar to JPEG's quantization tables).

The perceptual rationale: human vision is much less sensitive to high-frequency detail than to low-frequency structure. You notice a wrong colour in a large area (low-frequency error) far more than you notice a missing fine texture (high-frequency error). Custom matrices exploit this by quantizing high-frequency coefficients more aggressively.

Rate-Distortion Optimization

The encoder does not simply pick a single QP and apply it uniformly. Modern encoders like x264 perform rate-distortion optimisation (RDO) at the macroblock level. For each macroblock, the encoder tries multiple coding options (different partition sizes, prediction modes, QP values) and selects the one that minimizes a cost function:

J = D + λ × R

where:

  • D is the distortion (typically measured as Sum of Squared Differences between the original and reconstructed block)
  • R is the number of bits required to encode the block
  • λ (lambda) is the Lagrangian multiplier that controls the quality/bitrate tradeoff

A higher λ favours smaller files (more aggressive quantization). A lower λ favours higher quality. The encoder adjusts λ frame by frame to meet the target bitrate or quality level.

This RDO process is why encoding is so much slower than decoding. The decoder only needs to follow one path. The encoder must explore many paths and pick the best one.

6. Entropy Coding: Squeezing Out Redundancy

After quantization, the transformed residual is a sparse matrix: mostly zeros with a few nonzero coefficients concentrated in the low-frequency corner. Entropy coding compresses this sparse data into the fewest possible bits.

Zig-Zag Scan

Before entropy coding, the 4x4 (or 8x8) block of quantized coefficients is serialized into a one-dimensional sequence using a zig-zag scan pattern:

 0  1  5  6
 2  4  7 12
 3  8 11 13
 9 10 14 15

The numbers indicate the order in which coefficients are read. This pattern starts at the top-left (DC coefficient, lowest frequency) and progresses toward the bottom-right (highest frequency). Since high-frequency coefficients are most likely to be zero after quantization, the zig-zag scan tends to produce long runs of trailing zeros.

For the quantized block:

 3  2  0  0
 2  1  0  0
 0  0  0  0
 0  0  0  0

The zig-zag scan produces: 3, 2, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

This can be represented as: (3), (2), (2), (skip 1, 1), (end of block). Much more compact.

CAVLC: Context-Adaptive Variable-Length Coding

CAVLC is the entropy coding method available in all H.264 profiles. It encodes the quantized coefficients using variable-length codes (similar to Huffman coding) that adapt based on the context of neighbouring blocks.

The encoding process for each block:

  1. Encode the total number of nonzero coefficients and trailing ones (coefficients with absolute value 1 at the end of the zig-zag scan). The code table used depends on the number of nonzero coefficients in neighbouring blocks (left and above), which is the "context-adaptive" part.

  2. Encode the sign of trailing ones. Each sign is one bit.

  3. Encode the levels (absolute values) of the remaining nonzero coefficients, starting from the highest frequency and working backward. The coding table adapts based on recently coded levels: after coding a large level, the encoder switches to a table designed for larger values.

  4. Encode the total number of zeros before the last nonzero coefficient.

  5. Encode the run of zeros before each nonzero coefficient (run-before codes).

CABAC: Context-Adaptive Binary Arithmetic Coding

CABAC is available in the Main and High profiles and is significantly more powerful than CAVLC. It achieves 5 to 15% better compression at the cost of higher computational complexity.

Arithmetic coding represents an entire message as a single number in the interval [0, 1). Unlike Huffman coding, which must assign an integer number of bits to each symbol (meaning a symbol with probability 0.7 still costs at least 1 bit, even though its information content is only 0.51 bits), arithmetic coding can approach the theoretical entropy limit.

CABAC works in three stages:

  1. Binarization: Each syntax element (coefficient level, motion vector component, block type, etc.) is converted into a binary string using predefined binarization schemes (unary, truncated unary, Exp-Golomb, or fixed-length).

  2. Context modeling: Each binary decision (bin) is assigned a probability model based on the context: what type of syntax element is being coded, the position within the binarization, and the values of previously coded bins and neighbouring elements. H.264 defines about 400 context models.

  3. Arithmetic coding engine: Each bin is coded using the current probability estimate, and the probability model is updated after each bin. The model adapts rapidly to local statistics: if a particular type of bin has been 1 recently, its probability of being 1 increases.

The computational cost of CABAC is substantial because each bin must be coded sequentially (the arithmetic coding state depends on the previous bin). This makes CABAC strictly sequential and difficult to parallelize, which is why some hardware decoders and low-power devices only support the Baseline profile with CAVLC.

Exp-Golomb Coding

H.264 uses Exponential-Golomb codes for various header and syntax elements outside the residual data: macroblock types, reference indices, motion vector differences, slice headers.

An Exp-Golomb code for a nonnegative integer n is constructed as:

code(n) = [M zeros] [1] [INFO]
 
where:
M = floor(log2(n + 1))
INFO = (n + 1) - 2^M, as a M-bit binary number

Examples:

Value Code Bits
0 1 1
1 010 3
2 011 3
3 00100 5
4 00101 5
5 00110 5
6 00111 5
7 0001000 7

Small values get short codes, large values get long codes. This is optimal when the distribution favours small values, which is true for most H.264 syntax elements (small motion vector differences are far more common than large ones).

7. Intra Prediction

I-frames have no reference to other frames, but that does not mean every macroblock is coded independently. H.264 exploits spatial redundancy within a frame by predicting each block from its already-decoded neighbours (the blocks above and to the left, which have already been decoded in raster scan order).

4x4 Luma Intra Prediction Modes

H.264 defines 9 intra prediction modes for 4x4 luma blocks:

Mode Name Description
0 Vertical Copy the pixels from the row above
1 Horizontal Copy the pixels from the column to the left
2 DC Fill with the average of above and left border pixels
3 Diagonal Down-Left Interpolate diagonally from top-right to bottom-left
4 Diagonal Down-Right Interpolate diagonally from top-left to bottom-right
5 Vertical-Right Interpolate between vertical and diagonal down-right
6 Horizontal-Down Interpolate between horizontal and diagonal down-right
7 Vertical-Left Interpolate between vertical and diagonal down-left
8 Horizontal-Up Interpolate between horizontal and diagonal down-left

Visually, these modes correspond to different directional edges and gradients:

         Vertical (0)
              |
    DL (3)   |   DR (4)
       \     |     /
        \    |    /
  Horiz (1)--+-- VR (5)
        /    |    \
       /     |     \
    HU (8)   |   HD (6)
              |
         VL (7)

For each 4x4 block, the encoder tries all 9 modes, computes the residual (actual pixels minus prediction) for each, and selects the mode that produces the smallest residual. The chosen mode index is transmitted in the bitstream. Since neighbouring blocks tend to have similar prediction modes (smooth regions tend to stay smooth), the mode index is differentially coded: the encoder transmits whether the mode matches the predicted most-probable mode, and if not, which of the 8 remaining modes it is.

16x16 Luma Intra Prediction

For macroblocks in smooth areas (sky, walls, gradual gradients), predicting each 4x4 block separately wastes bits on mode signalling. H.264 also supports 16x16 intra prediction with 4 modes:

  • Mode 0: Vertical
  • Mode 1: Horizontal
  • Mode 2: DC
  • Mode 3: Plane (a planar surface fitted to the border pixels)

The plane mode is particularly effective for smooth gradients. It fits a tilted plane to the top and left border pixels:

pred[x, y] = Clip((a + b × (x - 7) + c × (y - 7) + 16) >> 5)

where a, b, and c are computed from the boundary pixels. This single prediction mode can accurately represent a linear gradient across the entire macroblock, leaving a tiny residual.

Chroma Intra Prediction

Chroma blocks (Cb and Cr) use 4 prediction modes similar to the 16x16 luma modes but applied to 8x8 chroma blocks (in 4:2:0). The mode numbering is different: DC is mode 0, horizontal is mode 1, vertical is mode 2, and plane is mode 3.

8. The Deblocking Filter

Block-based coding produces a characteristic artifact: visible edges at block boundaries. Adjacent blocks may be quantized differently, predicted from different references, or use different coding modes. The result is discontinuities at the 4x4 and macroblock boundaries that the human eye perceives as a grid pattern, especially at low bitrates.

H.264 addresses this with an adaptive in-loop deblocking filter. The term "in-loop" is critical. The filter operates on the reconstructed frame inside the encoding/decoding loop, meaning the filtered output becomes the reference for future P and B frames. This is different from a post-processing filter applied after decoding: an in-loop filter improves not just the current frame's appearance but also the prediction accuracy of subsequent frames, creating a compounding benefit.

How the Filter Works

The deblocking filter operates on every 4x4 block boundary (both horizontal and vertical). For each boundary, it examines the pixels on either side and decides:

  1. Whether to filter at all. If the pixel values change sharply across the boundary because of actual image content (a real edge), filtering would blur the image. The filter only activates when the discontinuity is likely an artifact rather than a real edge.

  2. How strongly to filter. The filter strength depends on:

    • The QP of the adjacent blocks (higher QP means coarser quantization, meaning larger artifacts, meaning stronger filtering is appropriate)
    • The pixel gradient across the boundary
    • A boundary strength parameter (Bs) that depends on the coding modes of the adjacent blocks

The boundary strength Bs ranges from 0 to 4:

Bs Condition Filter strength
4 One block is intra-coded and the boundary is a macroblock edge Strongest
3 One block is intra-coded Strong
2 Different reference frames or different motion vectors Medium
1 One block has nonzero residual coefficients Light
0 Both blocks are inter-coded with same reference and similar motion vectors, and neither has nonzero residual No filtering

When Bs is 1, 2, or 3, the filter adjusts up to 3 pixels on each side of the boundary using a weighted average. When Bs is 4 (strongest), the filter can modify up to 4 pixels on each side.

The filter thresholds (alpha and beta) are derived from the QP. At QP 0, the thresholds are near zero (no filtering). At QP 51, the thresholds are maximum (aggressive filtering). The encoder can also adjust the filter strength per-slice using the slice_alpha_c0_offset and slice_beta_offset parameters, which shift the threshold curves.

Perceptual Impact

The deblocking filter is one of the most important quality features in H.264. Without it, H.264 video at moderate bitrates would look noticeably worse. Informal testing consistently shows that disabling the deblocking filter at the same bitrate produces visible blocking artifacts that are more objectionable than the slight softening the filter introduces. This is why the filter is enabled by default in every sensible encoder configuration.

The in-loop nature of the filter also improves compression efficiency by 5 to 10%. Filtered reference frames produce better motion-compensated predictions, resulting in smaller residuals for subsequent frames. This is a virtuous cycle: better references produce smaller residuals, which quantize to fewer bits, which allows higher quality at the same bitrate.

9. Profiles and Levels

H.264 defines a hierarchy of profiles and levels that specify which coding tools the decoder must support and what resolution/bitrate combinations are allowed. This is a practical necessity. A mobile phone from 2008 cannot decode the same bitstream as a 2024 desktop workstation, and the standard must accommodate both.

Profiles

Profile Key features Typical use
Constrained Baseline No B-frames, no CABAC, no weighted prediction, no 8x8 transform Video calls, mobile, low-power
Main B-frames, CABAC, weighted prediction Broadcast TV (DVB), Blu-ray
High 8x8 transform, custom quantization matrices, monochrome support Streaming, Blu-ray, broadcast
High 10 10-bit colour depth HDR content, professional
High 4:2:2 4:2:2 chroma subsampling Professional video, broadcast
High 4:4:4 4:4:4 chroma, lossless coding Studio production

The Baseline profile's restrictions are severe but deliberate. No B-frames means no reordering latency, which matters for real-time communication. No CABAC means simpler, faster, more power-efficient decoding. The WebRTC standard originally mandated H.264 Constrained Baseline for precisely these reasons.

The High profile adds 8x8 transforms, which are more efficient for smooth content (larger blocks capture longer-range correlations). It also adds custom quantization matrices, allowing the encoder to tune quantization to the content type. Most modern streaming uses High profile.

Levels

Levels constrain the decoder's workload: maximum resolution, frame rate, macroblocks per second, bitrate, and DPB (Decoded Picture Buffer) size.

Level Max Resolution Max Frame Rate Max Bitrate (High) Typical Use
3.0 720x576 25 fps 12.5 Mbit/s SD broadcast (PAL)
3.1 1280x720 30 fps 17.5 Mbit/s 720p streaming
4.0 1920x1080 30 fps 25 Mbit/s 1080p streaming, Blu-ray
4.1 1920x1080 30 fps 62.5 Mbit/s 1080p high bitrate
4.2 1920x1080 60 fps 62.5 Mbit/s 1080p60 gaming content
5.0 3840x2160 30 fps 168.75 Mbit/s 4K (rarely used with H.264)
5.1 4096x2160 30 fps 300 Mbit/s 4K cinema

Netflix uses High profile Level 4.0 for most of its 1080p H.264 content, which allows up to 25 Mbit/s (far more than the typical 5 to 8 Mbit/s they actually use). For 4K content, Netflix uses HEVC or AV1, not H.264, because H.264 at 4K requires impractically high bitrates.

The level also constrains the DPB size, which limits how many reference frames can be stored. Level 4.0 allows up to 4 reference frames at 1080p. More reference frames improve compression (the encoder has more candidates for motion compensation) but require more decoder memory.

10. Beyond H.264: HEVC, AV1, and VVC

H.264 was a revolutionary codec, but it has been surpassed by several successors, each offering better compression at the cost of higher encoding complexity.

H.265/HEVC (2013)

The High Efficiency Video Coding standard roughly halves the bitrate required for the same quality compared to H.264. Key improvements:

Larger coding units. HEVC replaces the fixed 16x16 macroblock with a flexible Coding Tree Unit (CTU) of up to 64x64 pixels. The CTU can be recursively subdivided into smaller Coding Units (CUs) using a quadtree structure, allowing the encoder to use large blocks in smooth areas and small blocks around edges. This single change accounts for a significant portion of HEVC's efficiency gain.

Better intra prediction. HEVC supports 35 intra prediction modes (compared to H.264's 9 for 4x4 blocks), providing finer angular resolution.

Better motion compensation. HEVC adds merge mode, advanced motion vector prediction, and improved interpolation filters.

Sample Adaptive Offset (SAO) filter. An additional in-loop filter beyond deblocking that reduces ringing artifacts.

Improved entropy coding. HEVC uses only CABAC (no CAVLC option), with improved context modelling.

The downside: HEVC encoding is 3 to 10 times slower than H.264 encoding at equivalent quality settings. And the licensing situation is notoriously complex. Multiple patent pools (MPEG-LA, HEVC Advance, Velos Media) claim essential patents, and the total royalty burden is substantially higher than H.264. This licensing mess is the primary reason that HEVC adoption has been slower than expected and directly motivated the creation of AV1.

AV1 (2018)

Developed by the Alliance for Open Media (Google, Mozilla, Microsoft, Amazon, Netflix, Apple, and many others), AV1 is a royalty-free codec designed as a direct competitor to HEVC.

AV1's technical features include:

  • Superblocks up to 128x128 with flexible recursive partitioning (quadtree + binary/ternary splits)
  • Constrained directional enhancement filter (CDEF) for edge preservation
  • Loop restoration filter with switchable Wiener and self-guided filtering
  • Film grain synthesis for preserving film grain appearance without wasting bits encoding it
  • Screen content coding tools for desktop sharing and gaming

AV1 achieves compression efficiency comparable to or slightly better than HEVC. The encoding speed was initially a serious problem (10 to 100 times slower than H.264 in the reference encoder), but production encoders like SVT-AV1 and rav1e have narrowed the gap significantly. Hardware decoder support has become widespread since 2021, with support in most modern smartphones, GPUs, and smart TVs.

VVC/H.266 (2020)

Versatile Video Coding is the latest standard from ITU-T/ISO, targeting 30 to 50% bitrate reduction over HEVC. Key additions:

  • Affine motion compensation for rotation and zoom
  • Bi-directional optical flow (BDOF) for refining bi-prediction at the decoder
  • Geometric partitioning for non-rectangular block shapes
  • Adaptive loop filter (ALF) with Wiener-based design
  • Block sizes up to 128x128 with multi-type tree partitioning

VVC is still in early deployment. Hardware decoder support is limited, encoding is extremely compute-intensive, and the licensing situation is not yet fully resolved.

Codec Comparison

Here is a rough comparison of bitrate required for equivalent visual quality at 1080p30, relative to H.264:

Codec Year Relative Bitrate (lower is better) Encoding Speed (relative) Licensing
H.264/AVC 2003 1.0x (baseline) 1.0x MPEG-LA pool (~€0.10/unit, capped)
H.265/HEVC 2013 0.5x 0.3x Multiple pools, ~€0.40-0.80/unit
VP9 2013 0.55x 0.4x Royalty-free (Google)
AV1 2018 0.45x 0.15x Royalty-free (AOM)
VVC/H.266 2020 0.35x 0.05x Patent pools forming

These numbers are approximate and depend heavily on content type, encoder implementation, and encoding speed. The important pattern: each generation roughly halves the bitrate, but encoding complexity increases dramatically. Moore's Law provides the compute to handle it, albeit with a delay of several years between a codec's standardization and its practical deployment.

11. Practical Encoding: FFmpeg and x264

Theory is useful, but most people interact with H.264 through FFmpeg and the x264 encoder library. Here is how to use them effectively.

Basic Encoding

Encode a video file to H.264 with default settings:

ffmpeg -i input.mp4 -c:v libx264 -crf 23 -preset medium -c:a aac -b:a 128k output.mp4
  • -c:v libx264: Use the x264 encoder
  • -crf 23: Constant Rate Factor, the primary quality control
  • -preset medium: Speed/compression tradeoff
  • -c:a aac -b:a 128k: Encode audio as AAC at 128 kbit/s

CRF: Constant Rate Factor

CRF is the recommended rate control mode for single-pass encoding when file size is not a hard constraint. It targets a constant perceptual quality by adjusting the QP frame by frame.

# High quality, larger file
ffmpeg -i input.mp4 -c:v libx264 -crf 18 -preset slow output_hq.mp4
 
# Medium quality, balanced
ffmpeg -i input.mp4 -c:v libx264 -crf 23 -preset medium output_med.mp4
 
# Lower quality, smaller file
ffmpeg -i input.mp4 -c:v libx264 -crf 28 -preset fast output_low.mp4

CRF values typically used in production:

CRF Quality Typical use
0 Lossless Archival, intermediate editing
15-18 Visually lossless Film production, high-end streaming
19-23 High quality Premium streaming (Netflix, etc.)
23-28 Good quality General streaming, social media
28-35 Acceptable Low-bandwidth streaming, mobile
35+ Low quality Thumbnails, previews

CRF 23 is the x264 default and is a reasonable starting point for most content. CRF 18 is often cited as "visually lossless" for most content, meaning the compression artifacts are imperceptible to most viewers on most displays.

Presets

The preset controls how much CPU time the encoder spends optimising compression. Slower presets produce smaller files at the same quality but take longer to encode.

# Available presets from fastest to slowest:
# ultrafast, superfast, veryfast, faster, fast, medium, slow, slower, veryslow, placebo
 
# Demonstration of the speed/size tradeoff:
ffmpeg -i input.mp4 -c:v libx264 -crf 23 -preset ultrafast output_fast.mp4
ffmpeg -i input.mp4 -c:v libx264 -crf 23 -preset veryslow output_small.mp4

Typical compression differences relative to medium at the same CRF:

Preset File size (relative) Encode speed (relative)
ultrafast +80% to +100% 10x faster
superfast +40% to +60% 6x faster
veryfast +20% to +30% 4x faster
faster +10% to +15% 2.5x faster
fast +5% to +8% 1.7x faster
medium baseline 1.0x
slow -3% to -5% 0.5x
slower -5% to -8% 0.25x
veryslow -8% to -12% 0.1x
placebo -9% to -13% 0.02x

The placebo preset is almost never worth using. It achieves marginal compression improvement over veryslow at 5 times the encoding time. The practical sweet spot for offline encoding is slow or slower. For live encoding or real-time transcoding, veryfast or faster is typical.

Profile and Level Selection

# Baseline profile (maximum compatibility, no B-frames or CABAC)
ffmpeg -i input.mp4 -c:v libx264 -profile:v baseline -level 3.0 output_baseline.mp4
 
# Main profile (B-frames and CABAC enabled)
ffmpeg -i input.mp4 -c:v libx264 -profile:v main -level 4.0 output_main.mp4
 
# High profile (8x8 transform, custom quant matrices)
ffmpeg -i input.mp4 -c:v libx264 -profile:v high -level 4.0 output_high.mp4

Two-Pass Encoding for Target Bitrate

When you need to hit a specific file size (for example, fitting a 90-minute lecture recording into 700 MB for distribution on a university platform), two-pass encoding is the way to go.

Target bitrate calculation:

Target file size: 700 MB = 5,600 Mbit
Duration: 90 minutes = 5,400 seconds
Audio bitrate: 128 kbit/s = 0.128 Mbit/s
Audio total: 0.128 × 5400 = 691 Mbit
Video budget: 5,600 - 691 = 4,909 Mbit
Video bitrate: 4,909 / 5,400 ≈ 909 kbit/s

The encoding commands:

# Pass 1: analyze the video, produce a stats file
ffmpeg -i lecture.mp4 -c:v libx264 -b:v 900k -pass 1 -preset slow \
  -an -f null /dev/null
 
# Pass 2: encode using the analysis from pass 1
ffmpeg -i lecture.mp4 -c:v libx264 -b:v 900k -pass 2 -preset slow \
  -c:a aac -b:a 128k lecture_compressed.mp4

In the first pass, FFmpeg analyses the video to determine the complexity of each scene and produces a log file (ffmpeg2pass-0.log). In the second pass, it allocates bits optimally: more bits to complex scenes (fast motion, detailed textures) and fewer bits to simple scenes (static slides, talking head). This produces much better quality than single-pass CBR at the same average bitrate.

Rate Control Modes

Mode Flag Description Use case
CRF -crf N Constant quality, variable bitrate Offline encoding, quality priority
ABR -b:v Nk Average bitrate, single pass Quick encodes with size constraint
2-pass -b:v Nk -pass 1/2 Target bitrate with scene analysis Precise file size targeting
CBR -b:v Nk -minrate Nk -maxrate Nk -bufsize Nk Constant bitrate Broadcast, streaming segments
CQP -qp N Constant QP, no rate adaptation Testing, research

CRF is the right choice for 90% of encoding tasks. If you are encoding for streaming and need bitrate constraints, use CRF with -maxrate and -bufsize:

# CRF with VBV (Video Buffering Verifier) constraints
# Target quality CRF 23, but never exceed 5 Mbit/s
ffmpeg -i input.mp4 -c:v libx264 -crf 23 -maxrate 5000k -bufsize 10000k \
  -preset slow -c:a aac -b:a 128k output_constrained.mp4

Inspecting an Encoded File

After encoding, you can inspect the result:

# Show codec details
ffprobe -v error -show_entries stream=codec_name,profile,level,width,height,r_frame_rate,bit_rate \
  -of default=noprint_wrappers=1 output.mp4
 
# Show frame types and sizes
ffprobe -v error -show_entries frame=pict_type,pkt_size \
  -of csv=p=0 output.mp4 | head -60

A typical output might show:

I,185234
P,23456
B,8234
B,7891
P,21345
B,9012
B,8456

Confirming the expected pattern: large I-frames, medium P-frames, small B-frames.

12. Putting It All Together: The Encoding Pipeline

Here is the complete H.264 encoding pipeline for a single macroblock in a P-frame, from raw pixels to compressed bits:

Raw pixels


┌─────────────────────────┐
│  Motion Estimation       │  Search reference frame for best match.
│  (Full/Diamond/Hex)      │  Produce motion vector (dx, dy).
└────────────┬────────────┘


┌─────────────────────────┐
│  Motion Compensation     │  Reconstruct prediction from reference
│                          │  frame using motion vector.
└────────────┬────────────┘


┌─────────────────────────┐
│  Residual Computation    │  residual = original - prediction
│                          │
└────────────┬────────────┘


┌─────────────────────────┐
│  Forward DCT (4x4)      │  Convert spatial residual to frequency
│                          │  coefficients.
└────────────┬────────────┘


┌─────────────────────────┐
│  Quantization            │  Divide by Qstep, round to integer.
│  (QP-controlled)         │  This is the lossy step.
└────────────┬────────────┘


┌─────────────────────────┐
│  Zig-Zag Scan            │  Serialize 2D coefficients to 1D.
│                          │
└────────────┬────────────┘


┌─────────────────────────┐
│  Entropy Coding          │  CAVLC or CABAC compression.
│  (CAVLC or CABAC)        │  Output bitstream.
└────────────┬────────────┘


         Bitstream
 
   ┌─── Reconstruction loop (for reference frames) ──┐
   │                                                    │
   │  Dequantize → Inverse DCT → Add prediction        │
   │  → Deblocking filter → Store in DPB                │
   │                                                    │
   └────────────────────────────────────────────────────┘

The reconstruction loop is essential. The encoder must maintain a copy of exactly what the decoder will produce (including quantization error and deblocking), because future frames are predicted from the decoder's output, not from the original pixels. If the encoder used the original pixels as references, the prediction error would accumulate over time (a phenomenon called "drift"), and quality would degrade catastrophically over the length of a GOP.

This is also why the deblocking filter is "in-loop": it modifies the reference before it is used for prediction, so both encoder and decoder apply the same filter and stay in sync.

The Compression Stack in Perspective

Video compression is a stack of techniques, each contributing a piece of the overall compression ratio:

Technique Compression contribution Type
Chroma subsampling (4:2:0) 2:1 Lossy (perceptually near-lossless)
Motion compensation (inter prediction) 5:1 to 50:1 Lossless (residual is exact)
Intra prediction 2:1 to 5:1 Lossless (residual is exact)
DCT transform 1:1 (enables quantization) Lossless (invertible)
Quantization 2:1 to 20:1 Lossy
Entropy coding 2:1 to 4:1 Lossless
Deblocking filter 1.05:1 to 1.10:1 (indirect) Lossy (but improves future prediction)

The multiplicative effect of these stages is what gets you from 186 MB/s raw to 600 KB/s compressed. Not all of them apply to every frame or every macroblock, and the exact ratios depend on the content. A static surveillance feed compresses much better than a fast-paced action scene. A cartoon with flat colours compresses better than a nature documentary with fine textures. The encoder adapts continuously, adjusting motion search, partition sizes, QP, and prediction modes to match the local characteristics of the video.

H.264 was designed in an era when a "fast" computer had a 2 GHz Pentium 4. The fact that it achieves 100:1 to 1000:1 compression ratios while being decodable in real-time on hardware that cost €200 in 2005 is a genuine engineering achievement. The integer DCT, the context-adaptive entropy coding, the in-loop deblocking, the variable block size motion compensation: these are not incremental improvements. They are the result of decades of signal processing research, implemented with an obsessive focus on practical deployability.

Every time you watch a video on your phone, this entire pipeline runs thirty or more times per second, decoding millions of motion vectors, running inverse transforms on thousands of blocks, applying deblocking filters at tens of thousands of boundaries, and assembling the result into a frame that appears on your screen in 33 milliseconds. You never think about it. That is the point.