How DRAM Actually Works, From A Capacitor To Rowhammer
Try the interactive lab for this articleTake the quiz (6 questions · ~5 min)There is a running joke in the hardware industry that DRAM should not work. A single cell is one tiny transistor wired to an even tinier capacitor, and that capacitor holds one bit of information as a handful of electrons (roughly 40,000 electrons on modern processes) that leak away in milliseconds. Left alone, the cell forgets what it stored almost immediately. It cannot be read without destroying the value it held. And a modern 32 GiB DIMM contains about 250 billion of these cells, each of which has to be refreshed tens of thousands of times per second, each of which has to be accessible with nanosecond latency, and all of which have to coexist on a single silicon die in a device that costs less than a nice dinner.
That any of this works at all is the product of decades of process engineering, careful circuit design, and a memory controller dance that hides a stunning amount of complexity behind a deceptively simple interface. To the CPU, main memory looks like a flat array of bytes that you can read and write at random. Underneath, that array is split across channels, banks, rows, and columns, with a timing specification so precise that a read-modify-write of a single byte can involve thirty or forty discrete events on three different voltage rails.
This article walks through how DRAM actually works, from a single cell up to a multi-channel DIMM. The goal is to build a concrete mental model that makes sense of the jargon (RAS, CAS, banks, ranks, refresh, rowhammer, ECC) and of the observable performance behaviour of real memory systems. By the end it should be clear why random access to DRAM is not actually random, why refresh is both a correctness requirement and a performance tax, why Rowhammer is possible at all, and why the modern memory system is a remarkably honest trade-off between density, speed, and reliability.
One Cell: A Capacitor On A Transistor
Start with the smallest unit. A DRAM cell is a single access transistor (an NMOS device in most modern processes) with its drain connected to a storage capacitor. The capacitor stores a bit as a voltage: roughly Vcc (the supply voltage, typically 1.1 V for DDR4 or 1.05 V for DDR5) for a logical 1, and roughly 0 V for a logical 0. The access transistor's gate is wired to a word line that runs horizontally across the array. Its source is wired to a bit line that runs vertically. When the word line is asserted, the transistor turns on, connecting the capacitor to the bit line. When the word line is de-asserted, the transistor isolates the capacitor and the stored charge is left to slowly leak away.
That is the whole cell. One transistor, one capacitor, two wires. The layout is so dense that modern DRAM processes (around 12 nm half-pitch in 2024) pack hundreds of millions of cells per square millimetre of silicon. The capacitor is typically built as a deep trench into the substrate, or as a stacked cylinder above the transistor, to get enough capacitance in a small footprint. The capacitance is tiny, around 10 to 20 femtofarads, and the stored charge is in the region of 40,000 to 100,000 electrons. A single alpha particle from background radiation can dump enough charge into the capacitor to flip a bit, which is why ECC exists.
Reading the cell is destructive. To read, the memory controller precharges the bit line to halfway between 0 and Vcc (typically Vcc/2), then asserts the word line. The access transistor turns on, and the capacitor's charge is shared with the much larger capacitance of the bit line. If the capacitor held a 1, the bit line voltage rises very slightly (a few tens of millivolts). If it held a 0, it falls very slightly. Either way, the change is tiny, because the bit line capacitance is hundreds of times larger than the cell capacitance.
A sense amplifier at the end of the bit line compares the bit line voltage to a reference (typically the precharged Vcc/2 level) and drives the bit line to the full Vcc or 0 V based on the comparison. This does two things at once. It amplifies the tiny voltage change into a full logic level, so the external world can read the bit. And it drives the bit line back to a full voltage level, which rewrites the original charge back into the capacitor, because the access transistor is still on during the sense operation. Reading therefore restores the cell's charge, which is necessary because the sharing step discharged the capacitor.
Writing is simpler. The memory controller drives the bit line to the desired voltage (0 or Vcc) while the word line is asserted. The access transistor passes the voltage to the capacitor, and the new charge is stored. When the word line is de-asserted, the capacitor is isolated and the value is committed.
All of this happens on every memory access, which is why DRAM access times are dominated by the sense operation and the timing budget around it. A single cell read is not an instantaneous lookup; it is a precisely choreographed sequence of voltage transitions that takes tens of nanoseconds from start to finish.
Organising Cells: Rows, Columns, And Banks
A single cell is nothing. What matters is how you organise billions of them into a useful memory. The basic structure is a two-dimensional grid, where word lines run horizontally and bit lines run vertically. A word line asserts a full row of cells onto their respective bit lines, and the bit lines are all sensed in parallel by an array of sense amplifiers sitting at the bottom of the grid.
That array of sense amplifiers is called the row buffer, and it is the key performance structure in DRAM. A single row on a modern DIMM is typically 8 KiB wide (or 16 KiB for some configurations), which means 65,536 cells or so per row. When the memory controller issues an ACTIVATE command for a particular row, the row's word line is asserted, every cell in the row dumps its charge onto a bit line, the sense amplifiers fire in parallel, and the whole 8 KiB of row contents ends up sitting in the row buffer. Reads can then pull out specific columns (typically 64 or 128 bits at a time) from the row buffer at much lower latency than the initial activate.
The difference between a read that hits the current row buffer and one that requires a new activate is about three to five times in latency. Hitting the row buffer ("row buffer hit") takes maybe 15 ns for a CAS command. Missing the row buffer ("row buffer miss", also called a row conflict if a different row is currently open) costs a precharge (closing the currently-open row, which involves refreshing its contents back into the cells and then resetting the bit lines), an activate (opening the new row), and finally the column access. Total latency: around 45 to 60 ns.
This is the single most important performance fact about DRAM. Access patterns that touch the same row repeatedly are fast. Access patterns that jump between rows on the same bank are slow. The memory controller can hide some of this by predicting row closes or by leaving rows open speculatively, but the underlying physics is unforgiving.
Banks are the answer to the serialisation problem. A single DIMM has multiple banks, each of which has its own row buffer and can hold a different row open simultaneously. DDR4 has 16 banks per rank (organised as 4 bank groups of 4 banks each). DDR5 has 32 banks per rank (8 bank groups of 4 banks). The memory controller can interleave accesses across banks so that while one bank is waiting for a precharge or an activate, another bank is streaming data. This gives the illusion of parallelism even though each individual bank is strictly serial.
Bank groups were introduced in DDR4 to work around a specific timing problem. Bank-to-bank timing within the same bank group has to respect a longer tCCD_L (column-to-column delay, long) because they share some internal circuitry, while bank-to-bank between different groups uses the shorter tCCD_S. Memory controllers that can spread access across bank groups get higher sustained bandwidth, and high-performance workloads go out of their way to arrange data so that consecutive accesses hit different bank groups.
Ranks are another layer up. A rank is a set of chips on a DIMM that share the same chip select line. A DIMM can have one or two ranks (occasionally four on older server hardware). Accessing a different rank requires switching the chip select, which introduces a small timing penalty but also allows the controller to hide some operations. When you see a "2R" or "1Rx4" label on a DIMM, the first number is the number of ranks.
Channels are the top of the hierarchy. A channel is an independent 64-bit (or 72-bit with ECC) data bus between the memory controller and one or more DIMMs. Modern desktop CPUs have 2 or 4 channels, server CPUs have 8 or 12 channels, and each channel can sustain reads and writes independently of the others. The total memory bandwidth of a system is the sum of the channels' individual bandwidths, which is why DDR5-6400 on a 12-channel Xeon can reach 600 GB/s of aggregate bandwidth while the same DIMM in a dual-channel desktop delivers only 100 GB/s.
The Memory Controller: Commands, Timing, And State
The CPU does not talk to DRAM directly. It talks to a memory controller, which is integrated into the CPU in modern systems (since roughly 2008 for Intel and 2003 for AMD) and sits on the same die as the cores. The memory controller accepts load and store requests from the cache hierarchy, reorders them, issues the appropriate DRAM commands, and funnels the data back to the caches.
The command set DDR4 and DDR5 expose is small but strict:
- ACTIVATE (ACT): open a row in a specific bank by asserting its word line and sensing its contents into the row buffer. Must be preceded by a PRECHARGE if a different row is currently open in that bank.
- READ (RD) and WRITE (WR): read or write a column within the currently-open row. Can be a burst of 8 or 16 (DDR5) transfers, each moving the full channel width of data.
- PRECHARGE (PRE): close the currently-open row in a bank by driving the bit lines back to their precharge voltage. Discards the row buffer and prepares the bank for a new activate.
- REFRESH (REF): the memory controller issues a refresh command periodically, and the DRAM chip internally refreshes a batch of rows by reactivating them and letting the sense amplifiers restore their charges. Typical rate: one refresh command every 7.8 microseconds, refreshing roughly 1/8192 of the rows per command, so every row is refreshed once every 64 milliseconds (the "retention time").
- AUTOPRECHARGE: a variant of READ or WRITE that automatically issues a PRECHARGE after the data transfer completes, saving the controller a separate command.
Between these commands are timing constraints that the controller must respect on pain of corrupted data. The JEDEC DDR4 spec defines around two dozen timing parameters, with names like tRCD (row address to column access delay), tCAS (column access strobe latency), tRP (row precharge time), tRAS (row active time), tRFC (refresh cycle time), and tWR (write recovery time). For a typical DDR4-3200 DIMM, these land in the range of 13 to 25 nanoseconds each, and they must all be satisfied for every access. The memory controller contains a substantial piece of logic dedicated to tracking the state of every bank (open row, precharged, refreshing) and scheduling commands so that no constraint is violated.
The famous CAS latency number on a DIMM (the "16" in "CL16") is tCAS measured in bus clock cycles. At DDR4-3200, one bus clock is 1.25 ns, so CL16 is 20 ns from the READ command being issued to the first data beat appearing on the bus. Add tRCD for the activate (if the row was not already open) and tRP for the precharge (if a different row was open), and a full random-access latency on DDR4-3200 CL16-16-16 is about 60 ns from the controller's request to the first data beat. Once the burst starts, subsequent beats come every half-clock (400 picoseconds on DDR4-3200, because data is transferred on both edges of the clock), so an 8-beat burst takes 3.2 ns total.
This asymmetry, between the roughly 60 ns of setup and the roughly 3 ns of actual data transfer, is why DRAM bandwidth is much higher than DRAM random-access speed. For sequential access, the controller can overlap setup for the next burst with data transfer of the current one, and the bandwidth approaches the theoretical maximum of the channel (25.6 GB/s per channel at DDR4-3200). For truly random access, where every load misses the row buffer and conflicts with a different row, you are limited to one random access per 60 ns, which is about 17 million accesses per second per bank, far below the channel bandwidth.
Refresh: The Correctness Tax
The capacitor leakage problem is managed through refresh. Every row in the DRAM must be reopened and rewritten periodically to refill its capacitors before the charge drops below the sense amplifier's threshold. The JEDEC spec calls this "retention time" and sets it at 64 ms for standard DDR4 operation across the commercial temperature range, with a tighter 32 ms at the elevated temperatures found inside a running server (above 85 degrees Celsius). DDR5 keeps the same nominal 32/64 ms boundary but with a more granular temperature-controlled refresh mechanism.
The memory controller fulfils the refresh obligation by issuing a REFRESH command every tREFI nanoseconds. On DDR4 this is 7.8 microseconds. Each REFRESH command causes the DRAM chip to refresh a batch of rows internally, picking the next batch from an internal counter. To cover all 8,192 rows in a typical bank within 64 ms, the controller has to issue 8,192 refreshes over 64 ms, which works out to one every 7.8 microseconds. The arithmetic is exactly calibrated so that the entire memory is refreshed inside the retention window.
The refresh command is not free. While the chip is refreshing, the banks being refreshed are unavailable to the memory controller. The time a refresh blocks is called tRFC and is typically around 350 ns on DDR4 for a 16 Gb chip. During that 350 ns, the controller cannot issue any new commands to the affected banks. If you multiply tRFC by the refresh rate, you get the fraction of memory bandwidth lost to refresh: 350 ns every 7.8 microseconds is about 4.5 percent. On modern high-density chips (32 Gb DDR5), tRFC is longer, and the refresh overhead climbs to 6 or 7 percent. That is a meaningful tax, and a lot of DRAM research is about reducing it.
DDR5 introduced Fine Granularity Refresh, which allows the controller to issue smaller refreshes more often. Instead of pausing all banks in a group for 350 ns once every 7.8 us, the controller can pause a single bank for a shorter period more frequently, which reduces the worst-case latency spike that a refresh causes. It is a small improvement, but on latency-sensitive workloads it matters.
There is also Row Hammer mitigation to consider. We will come back to this in a later section, but it forces the controller to refresh certain "aggressor-adjacent" rows more often than the nominal 64 ms, which adds to the refresh overhead beyond the basic retention requirement.
The Row Buffer Is A Cache, Almost
If you squint, the row buffer looks a lot like a cache: it holds a subset of the DRAM contents in fast-access form, and you get a hit if the next access is to the same row, a miss if it is to a different row. And because memory controllers can be configured to leave rows open after a read or close them after a read, there is a policy decision to be made ("open row policy" versus "closed row policy") that mirrors cache replacement policies.
Open row policy keeps the row buffer holding its current row after the access completes, on the assumption that a subsequent access might hit the same row. This is the right choice for workloads with locality, like array traversals or database scans. Closed row policy precharges the bank immediately after the access, on the assumption that the next access will be to a different row and it is better to hide the precharge latency now than to pay it later. This is the right choice for random-access workloads.
Most memory controllers use an adaptive policy that leaves the row open for a short timeout after each access, and closes it if no new access arrives within the timeout. This tries to get the best of both worlds, and in practice it works well for mixed workloads.
Software can exploit the row buffer explicitly by arranging data structures so that related fields fall on the same row. This is called row affinity, and it is one of the reasons that column-oriented databases (where all values of a single column are stored contiguously) can outperform row-oriented databases on analytical queries: the column layout makes row buffer hits common, while the row layout scatters the interesting values across many different rows.
ECC: Catching Bit Flips Before They Become Bugs
DRAM bits flip. Sometimes it is because an alpha particle from natural radioactivity in the chip's packaging hit a storage capacitor and dumped charge into it. Sometimes it is because a neutron from cosmic ray showers penetrated the packaging and triggered a secondary cascade in the silicon. Sometimes it is because a weak cell has drifted out of spec and loses charge faster than expected. Sometimes it is because of a capacitively-coupled disturbance from a neighbouring row. The exact failure rate depends on process, altitude, temperature, and age, but Google's famous 2009 paper ("DRAM Errors in the Wild") measured rates of 25,000 to 70,000 correctable errors per billion device-hours in their production fleet. That works out to a few errors per DIMM per year in a typical data centre deployment.
ECC (Error Correcting Code) memory uses redundant storage to detect and correct these errors. The most common scheme is a single-error correct, double-error detect (SECDED) Hamming code that adds 8 parity bits to every 64 data bits. The memory controller computes the parity when writing and recomputes it when reading, and the comparison between the stored and computed parity lets it either correct a single-bit error (by identifying which bit is wrong and flipping it back) or detect a double-bit error (without being able to correct it).
SECDED catches the common case and lets servers run for months or years without a memory-induced crash. It does not catch every failure; triple-bit errors can go undetected, and some failure modes affect entire chips, which looks like a burst error that SECDED cannot handle. For those, server hardware uses Chipkill, a scheme where the ECC is designed so that a complete failure of an entire chip on the DIMM can still be corrected. Chipkill achieves this by spreading the data of a single ECC word across multiple chips, so the loss of any one chip is equivalent to the loss of some bits within each word, which the ECC can handle if it is strong enough. AMD calls their version x4 or x8 chipkill; IBM has its own name.
DDR5 took an unusual step: every DDR5 DIMM, even non-ECC desktop memory, has on-die ECC. This is different from the system-level ECC used on servers. On-die ECC is internal to the DRAM chip, protecting the array against single-cell failures that arise from aggressive scaling, and the results are hidden from the outside world (the chip presents its contents as if they were perfect). It is essentially a manufacturing reliability feature: the chip can be built with cells that sometimes fail, and the on-die ECC corrects the failures before they are visible. Server-grade DDR5 still adds system-level ECC on top, so a DDR5 RDIMM has both layers of protection.
Rowhammer: The Attack That Should Not Exist
In 2014, Yoongu Kim and colleagues at Carnegie Mellon published a paper titled "Flipping Bits in Memory Without Accessing Them", which demonstrated that by rapidly activating and precharging specific rows on commodity DDR3 DIMMs, you could cause bit flips in physically adjacent rows that you never touched. The effect was reproducible, it worked on almost all DIMMs they tested, and it opened a new category of memory attacks.
The mechanism is a subtle side effect of DRAM scaling. As process sizes shrink, rows are packed closer together, and the electromagnetic coupling between adjacent word lines becomes stronger. Every time a word line is asserted, it inductively and capacitively disturbs the neighbouring word lines, causing a small amount of charge to leak out of the cells attached to them. In normal operation, the disturbance is too small to matter: a cell might lose a few hundred electrons per hammer, but the refresh cycle will restore the charge before it crosses the threshold. Hammer the same row hundreds of thousands of times in a single refresh window, however, and the cumulative charge loss can exceed the threshold, flipping a bit in a neighbouring row you do not own.
The attack is devastating because it breaks the DRAM's isolation property. The memory controller's job is to present each address as an independent value. Rowhammer shows that accesses to one address can corrupt another address, which violates the fundamental assumption on which every memory-based security mechanism is built. If you can flip a bit in a page table entry, you can point your user-space process at a kernel page and read arbitrary kernel memory. If you can flip a bit in a JavaScript garbage collector's type tag, you can trick the engine into treating an object as a different type and break out of the sandbox. The Google Project Zero team showed both of these within months of the original paper.
DRAM vendors have been playing whack-a-mole with Rowhammer ever since. DDR4 added Target Row Refresh (TRR), a vendor-specific mechanism where the DRAM chip tracks which rows are being hammered and issues extra refreshes to their neighbours. TRR worked for a while, but researchers have repeatedly shown that specific hammering patterns can evade TRR on specific DIMMs, including the TRRespass attack in 2020 that systematically tested activation patterns to find ones TRR missed. DDR5 includes Refresh Management (RFM), a more structured version where the memory controller can signal the DRAM chip to do targeted refreshes, and per-row activation counters that enable smarter tracking. RFM is more robust than TRR in principle, but the cat and mouse continues: in 2023 and 2024 researchers showed Rowhammer variants that work on some DDR5 DIMMs under the right conditions.
The complication is that Rowhammer is fundamentally a physics problem. The only way to eliminate it entirely would be to space the rows further apart or to increase the cell capacitance, both of which directly cut into density and cost. Vendors are not willing to give up density, so the mitigation will always be an adversarial game.
The Rest Of The Stack: DIMMs, LRDIMMs, And HBM
Everything so far has been about the DRAM chips. The DIMMs (Dual Inline Memory Modules) they live on add another layer of complexity. A standard UDIMM (Unbuffered DIMM) wires the chips directly to the memory channel, with a serial presence detect (SPD) EEPROM on board that tells the BIOS how the module is configured. An RDIMM (Registered DIMM) adds a register chip between the address and command lines and the DRAM chips, to buffer the electrical load and allow more chips per channel. An LRDIMM (Load-Reduced DIMM) goes further and adds a buffer to the data lines as well, at the cost of a small latency penalty.
Why does this matter? Because the number of DIMMs you can put on a single channel is limited by the electrical load the channel can drive, and buffering lets you put more. A typical UDIMM channel can drive two DIMMs; an RDIMM channel can drive four; an LRDIMM channel can drive eight. Server workloads that need terabytes of memory per socket depend on this to reach their capacity targets.
High Bandwidth Memory (HBM) is a different approach entirely. Instead of DIMMs on a bus, HBM stacks DRAM dies directly on top of a logic die, connected by through-silicon vias (TSVs), and the stack is placed on a silicon interposer next to the CPU or GPU. The per-channel bus width is 1024 bits or more, and the stack runs at a relatively modest clock rate, but the total bandwidth is enormous: HBM3e (shipping in 2024) reaches 1.2 TB/s per stack. AI accelerators and high-end GPUs use HBM exclusively, because they are bandwidth-bound and can afford the higher cost per byte. Intel briefly shipped HBM on the Xeon Max CPUs (Sapphire Rapids) for HPC workloads, and the results were impressive on memory-bound benchmarks like STREAM.
The trade-off is clear: HBM gives you enormous bandwidth and low latency in a compact package, at the cost of much higher cost per gigabyte and a much lower total capacity per socket. Regular DDR5 gives you more GB for less money, at the cost of lower bandwidth and higher latency. Both will coexist for the foreseeable future.
What You See From Userspace
For an application programmer, all of this complexity is hidden behind a flat pointer interface. You read and write bytes at arbitrary addresses, and the hardware takes care of turning your accesses into the right sequence of activates, column accesses, and precharges. But the underlying structure leaks through in observable performance effects, and understanding them is the difference between "my program runs at 10 GB/s" and "my program runs at 60 GB/s".
The most important effect is row buffer locality. Sequential access to memory is fast because almost every access hits the same row as the previous one. Random access is slow because almost every access misses. A linked list traversal where each node is in a different row incurs the full tRCD + tCAS + tRP cost on every step, which is why linked lists are usually much slower than arrays even when the algorithmic complexity is the same.
The next effect is bank parallelism. A memory controller can have multiple operations in flight on different banks simultaneously, which means that code that deliberately accesses memory in a bank-interleaved pattern (by striding by the bank size) can get higher sustained throughput than code that hammers a single bank. High-performance libraries like BLAS implementations go out of their way to choose tile sizes that maximise bank parallelism on the target hardware.
Channel parallelism is similar but at a coarser grain. A system with 8 memory channels can handle 8 independent access streams at full speed if the streams are mapped to different channels. The memory controller usually interleaves physical addresses across channels at a fine grain (every 256 bytes is common), so this mostly happens automatically, but NUMA-aware code has to be careful to keep allocations on the channels closest to the CPUs that use them.
And then there is the refresh tax. A workload that is extremely sensitive to latency tail (like a trading system or a soft real-time renderer) will see occasional spikes of 350 ns when a refresh command is issued while it needed an access to the refreshing bank. You cannot avoid the refresh, but you can mitigate the tails by avoiding single-bank hot spots, which spread the refreshes across more banks and make each individual stall less visible.
LPDDR: The DRAM In Your Phone
Everything so far has been framed around DDR4 and DDR5 DIMMs, which dominate desktops and servers. Mobile and laptop silicon uses a different family called LPDDR (Low-Power DDR), and the differences are informative. A phone SoC in 2026 typically carries LPDDR5X soldered directly to the package substrate, with no socket, no DIMM, and no SPD EEPROM. The DRAM chips are placed next to the SoC (or stacked on top of it in a package-on-package arrangement) and wire-bonded or flip-chip attached through a tiny ball grid array.
The power envelope is the defining constraint. A desktop DIMM can burn several watts without anyone caring; a phone has a total power budget of a handful of watts for the entire SoC, and memory cannot consume more than a fraction of that. LPDDR achieves low power in several ways. It runs at a lower voltage (0.5 V for LPDDR5X versus 1.1 V for DDR5), which cuts dynamic power roughly quadratically. It uses a narrower bus (16 or 32 bits per channel instead of 64), which reduces the number of I/O drivers that have to switch on every transfer. It supports deep self-refresh modes where the chip maintains its contents on its own while the SoC sleeps, at power levels measured in milliwatts. And it uses temperature-compensated self-refresh (TCSR), where the refresh rate is adjusted based on the die temperature so that cold DRAM refreshes less often, saving background power during idle.
LPDDR also has partial array self-refresh (PASR), which lets the operating system tell the DRAM to only refresh the portion of the array that currently holds valid data. When an Android phone is sitting idle with most of its RAM empty, the kernel can tell the memory controller to stop refreshing the unused banks, saving measurable standby power. Over a 24-hour day, these savings add up to meaningful battery life, which is why LPDDR dominates every device that runs on a battery and why you will never see a standard DDR5 DIMM in a phone no matter how cheap it gets.
The trade-off is bandwidth per pin. LPDDR5X tops out at around 8533 MT/s as of 2024, with a per-channel bandwidth of about 17 GB/s on a 16-bit channel. Phone SoCs stack four or eight of these channels to reach 68 to 136 GB/s of aggregate bandwidth, which is impressive for a device that runs on battery but modest compared to a desktop DDR5 system. Apple's M-series SoCs push this harder: the M3 Max uses 512-bit LPDDR5 and hits around 400 GB/s, approaching server-class bandwidth in a laptop package. That is possible because Apple controls both the SoC and the memory topology and can afford to route very wide buses at the cost of package complexity.
GDDR: Bandwidth At Any Cost
GPUs live in the opposite regime. Graphics workloads are almost always bandwidth-bound (pixel shaders chewing through texture lookups, compute kernels streaming large matrices), and the cost of a GPU board can absorb a lot of memory expense. The result is GDDR, a family of DRAM that sacrifices capacity and latency for sheer transfer rate. GDDR6 runs at 16 to 21 Gbps per pin, and GDDR6X pushes past 24 Gbps per pin using PAM4 signalling, where each symbol encodes two bits of information. A modern GPU pairs GDDR6X with a 256-bit or 384-bit memory bus and reaches bandwidths of 1 TB/s or more.
The electrical engineering behind this is serious. At 24 Gbps per pin, a single bit time is about 42 picoseconds, which is shorter than the flight time of a signal across a few centimetres of PCB. Signal integrity becomes the dominant design problem, and the DRAM chips have to sit within a few centimetres of the GPU die on a carefully impedance-controlled board. The PAM4 modulation in GDDR6X halves the symbol rate for a given bit rate, making signal integrity tractable at the cost of reduced noise margin. Even so, GDDR6X boards run hot enough that the memory chips need active cooling alongside the GPU die.
GDDR memory has the same basic cell structure as DDR and LPDDR; the differences are all at the interface. The internal banks, rows, row buffers, refresh cycles, and all the timing constraints we discussed earlier apply equally to GDDR. What changes is that the chip is tuned for much higher per-pin throughput, at the cost of much worse per-cell latency and much lower density per chip. GDDR chips max out around 2 GB each, versus 8 or 16 GB for DDR5 chips, because the die area is dominated by the I/O circuitry needed to push those gigabit data rates.
Memory Training: How The System Learns Its Own Timing
When a computer boots, the memory controller does not yet know what is plugged into its channels. It does not know the capacity, it does not know the speed rating, and it does not know the precise electrical characteristics of the board it is sitting on. All of this has to be discovered and calibrated before the first cache line of kernel code can be loaded. The process is called memory training, and it is one of the longest individual steps in a modern boot sequence, measured in hundreds of milliseconds to a couple of seconds depending on the memory topology.
Training starts with the SPD EEPROM on each DIMM (or the equivalent configuration ROM on soldered LPDDR). The SPD contains the JEDEC-standard parameters: size, organisation, speed bin, tCL, tRCD, tRP, tRAS, vendor, part number, XMP or EXPO profiles, and so on. The BIOS or UEFI firmware reads the SPD over an I2C bus (the SMBus), picks a target frequency and voltage, and configures the memory controller's PHY (physical layer interface) accordingly.
Then the real work begins. The PHY has to be calibrated for the specific electrical environment: the trace lengths to each chip, the termination impedances, the reference voltages for the I/O receivers, the timing of when data is expected to appear on each bit line relative to the clock. For DDR5 this involves training dozens of parameters per byte lane, and the training procedure systematically sweeps values, tests for correctness with known patterns, and picks the centre of the working window for each parameter. Read training, write training, DQS gating, Vref training, and duty cycle adjustment all run in sequence. On a high-capacity server system with many DIMMs per channel, the training can take several seconds and becomes the single slowest step in the boot sequence.
The training results are sometimes cached across reboots. The firmware can hash the SPD contents and reuse a previously-validated set of timings if the memory topology has not changed, dropping boot time by seconds. The downside is that if the cached timings are wrong (for example because a DIMM was swapped and the hash matched by accident, or because temperature changed enough that the margins shifted), the system can hang during the first memory access with no useful error message. For this reason, server firmware usually retrains on a fresh boot and only uses cached values after a warm reboot.
Overclockers discover memory training the hard way when they push frequencies past the SPD spec. The memory controller will try to train at the requested frequency, fail because the margins are too tight, and either fall back to a lower frequency, post an error, or hang entirely. Advanced motherboards expose dozens of manual timing parameters that let you override the training results, and enthusiasts spend hours finding the combinations that work. It is one of the last areas of PC tuning where you can measurably improve performance by hand.
NUMA: When Memory Has A Home
On a multi-socket server, memory stops being uniform. Each CPU socket has its own memory controller and its own set of DIMMs. A load issued by a core on socket 0 to an address that lives on socket 1 has to travel over the inter-socket interconnect (Intel UPI, AMD Infinity Fabric), which adds significant latency and consumes limited cross-socket bandwidth. This is NUMA, Non-Uniform Memory Access, and it changes the performance model fundamentally.
A local DRAM access on a modern server takes about 90 to 100 nanoseconds from the core's perspective. A remote DRAM access (memory on the other socket) takes 140 to 200 nanoseconds, depending on the interconnect generation and load. The bandwidth is also asymmetric: a core can pull data from its local memory at the full channel bandwidth, but cross-socket accesses are limited by the interconnect, which is typically a fraction of the aggregate memory bandwidth on each side.
Software that ignores NUMA pays for it. A naive multi-threaded program that allocates all of its memory on the first socket and then runs worker threads on both sockets will see the threads on the second socket running at half the speed of the threads on the first. NUMA-aware software asks the kernel for memory allocations that match the thread's home node. Linux provides the numa_alloc_onnode and numactl interfaces for this, and modern allocators like jemalloc and tcmalloc have NUMA-aware caches.
On single-socket systems with monolithic CPUs, NUMA is not a concern. On single-socket systems with chiplet architectures (AMD EPYC and Ryzen, where the die is physically multiple chiplets connected by an on-package interconnect), NUMA effects can show up even within a socket, because different cores are closer to different memory controllers. AMD's chiplet topology makes this visible on benchmarks, and tuning can recover some of the latency.
A Concrete Latency Breakdown
To make the numbers tangible, here is a full accounting of a single DDR5-6400 random-access load on a modern Intel desktop, from the core's perspective:
- L1 data cache lookup: 1.2 ns (5 cycles at 4 GHz). The lookup misses.
- L2 cache lookup: 3.6 ns (additional). Misses.
- L3 cache lookup: 9 ns (additional). Misses.
- The load is sent to the memory controller via the ring or mesh interconnect: 5 ns.
- The memory controller queues the request, schedules it, and issues a PRECHARGE to close the currently-open row on the target bank: 3 ns controller latency.
- tRP (row precharge time): 14 ns.
- ACTIVATE command for the new row: 1 bus clock.
- tRCD (row-to-column delay): 14 ns.
- READ command: 1 bus clock.
- tCAS (CAS latency at DDR5-6400 CL32): 10 ns.
- Data burst (16 beats on DDR5): 5 ns.
- Data travels back through the memory controller, the interconnect, and the cache hierarchy to the core: 7 ns.
- Total: approximately 72 nanoseconds from load instruction issue to data in a register.
That is for a random access that misses every cache and hits a row conflict. If the load had hit an open row (no precharge, no activate), it would have been about 30 ns faster. If it had hit L3, it would have been 60 ns faster. If it had hit L1, it would have been 71 ns faster. The gap between L1 and a DRAM miss is roughly 60-fold, which is why cache behaviour dominates the performance of almost every workload that cares about speed.
The practical consequence is that memory latency is the single most important performance parameter for a lot of software, and the CPU architects know it. Prefetchers, out-of-order execution, non-blocking caches, memory-level parallelism, and the entire apparatus of modern CPU design exist largely to hide DRAM latency by keeping useful work flowing while the memory controller is busy. When those mechanisms fail (because the access pattern is unpredictable or the working set is too large), the CPU stalls and performance collapses to a small fraction of its potential.
CXL And The Future Of Memory Attach
The next structural change in how main memory connects to CPUs is Compute Express Link (CXL), a PCIe-based protocol that lets memory live on a separate device instead of on dedicated DDR channels. CXL.mem exposes a pool of DRAM (or other media) through a cache-coherent load/store interface, so a CPU can treat a CXL-attached memory device as if it were a slower, further-away NUMA node. The latency penalty is real, about 150 to 250 ns added on top of the underlying media, because the request has to traverse PCIe PHYs and switches before reaching the memory controller on the CXL device. But in exchange, you get things that DDR channels cannot offer: hot-pluggable memory, memory pooling across multiple hosts in a rack, tiered memory where cold pages get pushed to cheaper media, and disaggregated memory architectures that decouple compute capacity from memory capacity.
The first generation of CXL memory expanders shipped in 2023 and 2024 alongside Sapphire Rapids and Genoa. They look like PCIe cards with DDR5 DIMMs on them and a small controller chip bridging PCIe and DDR5. The operating system sees them as an extra NUMA node with high latency and reasonable bandwidth, and Linux has learned to page cold data into them and hot data out of them using the same mechanisms it already used for persistent memory. It is still early, but the trajectory is clear: the days when all of a server's memory lived on dedicated DDR channels behind the CPU are ending, and the hierarchy is getting one more tier above the SSD and below the DDR socket.
Why None Of This Will Get Simpler
The obvious question after all of this is why DRAM is so complicated and whether something simpler could replace it. The answer is that every alternative has been tried and every one has failed on the density-cost-latency triangle. SRAM is faster and needs no refresh but uses six transistors per bit instead of one-and-a-change, making it roughly ten times less dense and therefore roughly ten times more expensive per gigabyte. Flash is denser and non-volatile but is orders of magnitude slower and wears out with writes. Phase-change memory (3D XPoint, shipped by Intel as Optane) briefly looked like it could bridge the gap, with speeds closer to DRAM and non-volatility like flash, but the economics never worked and Intel shut the product line down in 2022. MRAM and ReRAM are active research topics but have not displaced DRAM at scale. Every year someone predicts the death of DRAM and every year DRAM quietly ships another few exabytes to data centres and phones.
DRAM's one-transistor-one-capacitor cell is almost magically efficient. A modern 32 Gb die fits 32 billion of them on a square centimetre of silicon, runs at nanosecond latencies, consumes a few watts under full load, and costs a few dollars. Replacing it with anything non-leaky would require giving up density, and giving up density means giving up capacity per dollar, and nobody wants to do that. So we live with the leakage, we live with the refresh, we live with Rowhammer, and we build ever more sophisticated controllers and ECC codes to paper over the cracks. It is a compromise, but it is the best compromise we have found in fifty years of trying, and it is likely to stay that way for the next decade at least.
Put all of this together, and the flat pointer interface is honest but not complete. The physical DRAM beneath is a hierarchical, parallel, state-machine-driven system where every access travels through multiple queues, a scheduling policy, and a layered timing model before becoming a set of voltages on a bit line. It is remarkable that it all works at all. It is even more remarkable that it works so reliably that we can usually forget about it and just pretend main memory is a flat array, right up until Rowhammer or a latency spike or a ECC correctable error reminds us that there are billions of leaky capacitors down there, each one one bit flip away from ruining our day.