How Large Language Models Actually Work
Try the interactive lab for this articleTake the quiz (6 questions · ~5 min)Large language models look mysterious from the outside because the interface is plain text and the output often feels fluent, contextual, and oddly flexible. Under the hood the system is far more mechanical. A model does not read a sentence the way a person reads it, and it does not reason in prose first and only then produce words. It turns text into tokens, turns tokens into vectors, applies a large stack of matrix operations, and then predicts one more token. That prediction is appended to the context and the same process repeats.
That loop is simple enough to state in one sentence, but the details matter. Why do tokens exist instead of raw characters. What does attention actually compute. Why does memory usage explode with long prompts. Why do inference systems care so much about KV cache, quantisation, and batching. Why can one model fit on a laptop with 4 bit weights while another needs several racks of GPUs. These are engineering questions, not marketing questions, and they are where the real system lives.
This article stays in that layer. We will cover the transformer architecture, tokenisation, embeddings, positional information, attention score calculation, feed forward blocks, training objective, inference loop, sampling, KV cache, quantisation, and the operational constraints of serving these models at scale. The target is not "AI for everyone". The target is the reader who wants to know what a model server in Frankfurt or Amsterdam is actually doing between receiving a prompt and streaming a reply back to the client.
The Core Job Is Next Token Prediction
An LLM is trained to estimate the probability distribution of the next token given the tokens that came before it. If the current token sequence is:
The packet reached the router because the ARP cache alreadythe model produces a probability distribution over the vocabulary for the next position. Tokens such as contained, held, had, or stored might get meaningful probability mass. A token such as elephant should get almost none.
This sounds underwhelming until you scale it. A model with billions of parameters has enough capacity to compress statistical patterns from code, prose, configuration files, forum posts, RFCs, manuals, transcripts, and textbooks into weight matrices. The training objective is still just next token prediction, but the patterns required to do that well include syntax, style, long range dependency tracking, common algorithmic structures, frequent factual associations, and many domain specific conventions.
The model never sees "truth" directly. It sees examples of token sequences and adjusts its weights so that the token that actually followed in the training corpus gets higher probability next time. This matters because it explains both the power and the failure mode. The power comes from compressing vast regularities of human generated text. The failure mode comes from the same place: a model can generate text that looks statistically plausible even when the underlying claim is false.
At inference time, everything reduces to a repeated loop:
- Take the prompt tokens.
- Run them through the network.
- Compute probabilities for the next token.
- Select one token according to a sampling rule.
- Append it and continue.
Everything else in the architecture exists to make that loop expressive enough to model language and fast enough to run economically.
Tokenisation Is The First Compression Layer
Models do not usually operate on raw Unicode characters. The input text is split into tokens, which are discrete IDs from a vocabulary learned during tokenizer training. Modern LLMs commonly use byte pair encoding variants, unigram tokenisation, or similar subword schemes.
The reason is practical. Character level models make sequences too long. Word level models explode the vocabulary and struggle with rare words, identifiers, and morphology. Subword tokenisation lands in the middle. Frequent pieces of text such as the, ing, tion, router, printf, or https become reusable units. Rare words are decomposed into smaller pieces instead of becoming unknown symbols.
Consider this example:
Katerina configured the load balancer in Athens.A tokenizer might break it roughly like this:
["Kat", "erina", " configured", " the", " load", " bal", "ancer", " in", " Athens", "."]The exact split depends on the tokenizer, but the principle is stable. The system is looking for pieces that are common enough to deserve their own symbol while still allowing rare strings to be represented without inventing a new vocabulary entry for every possible word.
Tokenisation affects almost every practical property of a model:
- Prompt length limits are measured in tokens, not characters.
- Billing in hosted APIs is usually by token count.
- Source code tokenises differently from prose, which is one reason code models are often trained with code heavy vocabularies.
- Languages with rich morphology or no whitespace separation can suffer if the tokenizer was poorly fit for them.
- Long identifiers, URLs, and hexadecimal blobs can consume many tokens and therefore more memory and latency.
If a prompt that looks short in plain text becomes a 9,000 token monster after tokenisation, inference cost follows the tokens, not the human eye.
Token Boundaries Shape What The Model Can Notice Easily
Tokenisation is not just a preprocessing detail. It changes which patterns are cheap for the model to learn and which patterns are awkward. A frequent programming keyword such as return may be one token. A long identifier such as customer_subscription_renewal_deadline may be broken into many pieces. After tokenisation the transformer no longer sees the raw source string directly. It sees the chosen pieces.
That matters for practical reasons:
- very long identifiers increase sequence length
- rare words may be split into fragments with weaker standalone meaning
- whitespace handling affects code formatting behaviour
- punctuation heavy data such as JSON, stack traces, and URLs can consume surprising context budget
Take two prompts that look similar to a person:
Explain why the PostgreSQL planner ignored my index.
Explain why the PostgreSQL query planner ignored my multicolumn covering index on created_at and status.The second prompt is not just longer in characters. It probably expands into more semantically meaningful tokens because technical compounds such as multicolumn, covering, created_at, and status may each split differently. That larger token footprint means more embedding lookups, more attention work in prefill, and more KV cache memory for the retained context.
This is also why prompt engineers pay attention to tables, logs, and pasted code. A stack trace with 250 lines may carry less useful information than a carefully summarised 20 line description, but the model pays for all the tokens either way.
For multilingual behaviour the tokenizer is even more consequential. A tokenizer trained mostly on English may represent Greek, Finnish, or code mixed text less efficiently. The model can still process it, but longer token sequences mean more compute and potentially less clean statistical reuse. When people say a model feels stronger in one language than another, the training corpus matters most, but tokenizer fit is part of the story as well.
Embeddings Turn Token IDs Into Dense Vectors
A token ID alone is just an integer. The network needs a dense numerical representation, so each token is mapped through an embedding table. If the vocabulary size is V and the hidden dimension is d_model, then the embedding matrix has shape V x d_model.
For a token ID t, the model looks up row E[t]:
x_0 = E[t]This vector is not hand designed. It is learned during training. Tokens that appear in similar contexts often end up in related regions of the embedding space, though the geometry is more complicated than simple "synonym closeness". The vector space is serving the downstream network, not human intuition.
Embeddings are one reason model weights are so large. A vocabulary of 100,000 tokens and a hidden size of 4,096 already yields more than 400 million embedding parameters before counting the rest of the transformer. For larger models the numbers grow quickly.
There is usually a matching output projection at the end of the network called the language modelling head. In many models this projection is tied to the embedding matrix, which saves parameters and often improves quality. Tied weights mean the same table used to convert token IDs into vectors is also used, transposed, to convert the final hidden state back into logits over the vocabulary.
Order Must Be Added Explicitly
A plain attention mechanism is permutation invariant. If you feed it the same set of token vectors in a different order, it has no built in sense that the order changed. Language obviously depends on order, so transformers add positional information.
The earliest transformer papers used sinusoidal positional encodings. Many modern LLMs use rotary positional embeddings, usually abbreviated RoPE. The idea is to modify the query and key vectors so relative position is encoded in how those vectors rotate across dimensions.
Why relative position matters becomes obvious with examples:
the router dropped the packet
the packet dropped the routerThe token set is identical, but the meaning is not. Positional information is what lets the model represent that distinction.
RoPE became popular because it handles relative position elegantly and extrapolates better than some older approaches. It is not magic. Very long context lengths still stress the model because attention cost grows with sequence length and because a model trained mostly on shorter contexts may not generalise cleanly to very long ones. Still, positional encoding is the answer to the basic question "how does the model know which token came first".
Attention Is Weighted Information Retrieval
Attention is the heart of the transformer. Every token representation at a layer is projected into three new vectors:
- query
Q - key
K - value
V
For a sequence of token states arranged into a matrix X, these are computed as:
Q = X W_Q
K = X W_K
V = X W_Vwhere W_Q, W_K, and W_V are learned weight matrices.
The attention score from token i to token j is the dot product between the query for i and the key for j, scaled by the square root of the head dimension:
score(i, j) = (Q_i · K_j) / sqrt(d_k)For autoregressive language models, a causal mask blocks access to future tokens. Token 10 can attend to tokens 1 through 10, but not 11 onward. After masking, the model applies softmax across the allowed positions:
alpha(i, j) = softmax_j(score(i, j))These alpha values are the attention weights. The output for token i is then the weighted sum of value vectors:
output_i = Σ_j alpha(i, j) V_jThat sentence sounds abstract until you read it operationally. Token i is asking: which earlier positions matter for me right now, and how much information should I pull from each one? The query is the question, the keys are address labels, and the values are the content retrieved once the relevant positions are chosen.
Consider a short sequence:
Sofia opened the archive because she needed the log.When processing she, one attention head may put high weight on Sofia. Another head may focus on the clause boundary around because. Another may track the object archive. Multi head attention works because different heads can specialise in different relationships.
Multi Head Attention Lets Different Patterns Coexist
A transformer block does not use one attention map. It uses many heads in parallel. If the hidden dimension is 4,096 and there are 32 heads, each head might operate on 128 dimensional query, key, and value vectors. The outputs of all heads are concatenated and projected back to the model dimension.
This allows the layer to model several kinds of dependency at once:
- local syntax
- long range reference resolution
- delimiter tracking
- code indentation and bracket matching
- list structure
- frequent phrase completion patterns
Not every head learns something clean or human interpretable. Some appear redundant. Some become highly specialised. Some mostly help in a narrow slice of contexts. The important point is that the model does not rely on one universal notion of relevance. It learns many partial notions and combines them.
This is also where compute cost starts to dominate. For sequence length n, attention requires n x n score interactions per head. Double the context length and the attention matrix grows by roughly four times. That quadratic growth is why very long prompts are expensive and why so much research has gone into sparse attention, chunking, linear attention variants, and better KV cache reuse.
A Small Attention Example Makes The Math Less Abstract
The equations are easier to trust once you see what they imply on a tiny sequence. Imagine the model is processing four tokens:
[The] [switch] [forwards] [frames]At one layer, each token state is projected to query, key, and value vectors. Suppose the query for forwards scores the earlier tokens like this after scaling:
The: 0.2
switch: 2.4
forwards: 1.1
frames: blocked by causal maskSoftmax converts those scores into weights:
The: 0.08
switch: 0.72
forwards: 0.20The output state for forwards becomes:
0.08 * V(The) + 0.72 * V(switch) + 0.20 * V(forwards)Read that mechanically. This head has learned that the token forwards should mostly pull information from switch, somewhat from itself, and barely from The. Another head in the same layer may focus on something else entirely, perhaps grammatical structure or nearby collocations.
In matrix form, one head computes:
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) VIf the sequence length is n, then Q has shape n x d_k, K has shape n x d_k, Q K^T becomes n x n, and multiplying by V produces n x d_v. That n x n score matrix is the expensive part. It is where every token compares itself with every earlier token. For a 16,000 token prompt, one head is not building a neat conceptual graph. It is performing a very large dense similarity calculation. Multiply that across dozens of heads and dozens of layers and the scale of the problem becomes obvious.
Feed Forward Blocks Do The Heavy Transformation Work
People often focus on attention, but each transformer layer also contains a position wise feed forward network, sometimes written as an MLP block. After attention updates the token representations, the model passes each position independently through a small neural network:
FFN(x) = W_2 sigma(W_1 x)In many modern models this becomes a gated variant such as SwiGLU:
FFN(x) = W_o (silu(W_a x) * (W_b x))The feed forward block is where much of the parameter count lives. Attention mixes information across tokens. The feed forward network performs a nonlinear transformation within each token's representation. A rough mental model is:
- attention decides where to gather information from
- feed forward decides how to reshape that gathered information
Residual connections and layer normalisation keep the signal stable across many layers:
h1 = x + Attention(LN(x))
h2 = h1 + FFN(LN(h1))Without residual paths, very deep transformers would be much harder to train. With them, layers can incrementally refine representations instead of rebuilding them from scratch every time.
Normalisation And Residual Paths Are What Let Depth Work At All
It is tempting to describe a transformer as attention plus an MLP repeated many times, but the stabilising machinery around those blocks is what makes modern depth practical. Without it, gradients become noisy, activation scales drift, and later layers become hard to train.
Layer normalisation rescales activations so each token representation stays in a manageable numerical range. Residual connections then add the block output back to the incoming signal:
h_next = h_current + block(LN(h_current))That gives the model two useful properties. A layer can make a small corrective update rather than invent a whole new representation. Information from early layers can still flow forward even if some later block is only partially useful.
This is one reason transformer stacks can be so deep. A layer that learns something valuable about bracket structure in code, list continuation in prose, or token level entity linkage does not need to carry the whole burden of meaning alone. It contributes an adjustment. The residual stream carries the evolving representation forward and later blocks keep refining it.
From a systems perspective, normalisation also matters because mixed precision arithmetic is everywhere in modern training and inference. When weights and activations are stored in reduced precision formats, keeping values numerically well behaved is not just nice for theory. It is necessary for stable high throughput kernels on real hardware.
Depth Matters Because Meaning Is Built Gradually
A single transformer layer is not enough. Modern LLMs stack dozens or even hundreds of layers. Early layers often capture local relationships. Middle layers assemble broader semantic patterns. Later layers shape the representation for the output token distribution.
This is not a rigid pipeline with sharp boundaries, but depth matters because language requires multiple passes of composition. To predict the right next token in a technical paragraph, the model may need to:
- recognise the current sentence structure
- track what entity pronouns refer to
- maintain topic continuity across several earlier paragraphs
- understand whether the context is prose, JSON, SQL, or Python
- infer the most plausible continuation under that mode
No single matrix multiplication does all of that. The stack builds it progressively.
This is also why larger models often improve not just because they have more parameters, but because they are wider and deeper. More width allows richer representations. More depth allows more rounds of transformation. The architecture is not just "bigger is better", but scale has historically bought tangible quality gains because the tasks encoded in natural language are extremely varied.
Training Data Pipelines Matter Almost As Much As Model Shape
The network architecture gets most of the public discussion, but a large fraction of model quality is determined before optimisation even begins. Training corpora have to be gathered, deduplicated, filtered, mixed, and packed into token sequences. If that data pipeline is poor, no amount of clean matrix multiplication rescues the final model.
Real corpora contain problems everywhere:
- duplicated web pages
- near duplicate mirrored documents
- machine generated spam
- OCR noise
- broken markup
- low quality forum reposts
- licence and provenance constraints
If a corpus is heavily duplicated, the model can overfit repeated phrasing instead of learning broad patterns. If code is mixed badly with prose, the model may lose syntax fidelity. If multilingual balance is poor, some languages receive far less effective training signal. Data mixture is therefore not a clerical step. It is part of the model design.
Packing strategy matters too. During training, examples are often concatenated into long token blocks to keep hardware busy. Boundary markers tell the model where one document ends and another begins. Sequence packing improves throughput, but it also changes how often the model sees transitions between unrelated texts. That affects what it learns about document structure and context continuity.
This is why two organisations can train similarly sized transformer backbones and still get noticeably different results. One may have better data curation, stronger deduplication, cleaner code extraction, or better filtering of low quality text. The public conversation often compresses all of that into the phrase "better training data", but operationally it is a long pipeline of engineering decisions.
Training Large Models Is A Distributed Systems Problem
Once models reach billions of parameters, training stops being something one accelerator can do alone. The work is split across many GPUs or other accelerators, and that introduces a second challenge beyond pure arithmetic: coordination.
Real training runs combine several forms of parallelism:
- data parallelism so different workers process different batches
- tensor parallelism so large matrix operations are split across devices
- pipeline parallelism so different layer groups sit on different machines
- optimiser state sharding so parameter moments do not all live in one place
Communication then becomes a primary bottleneck. Activations, gradients, and optimiser state all have to move across high bandwidth links. A cluster with weak interconnect can spend huge amounts of time waiting for synchronisation even if the raw compute hardware looks impressive on paper.
This is why serious training clusters care about topology, checkpoint recovery, and storage throughput. A run that spans weeks must tolerate node failure, network noise, and periodic checkpoint writes of very large model states. The transformer architecture gets most of the public attention, but the difference between a research scale run and a frontier scale run is often hidden in the orchestration layer.
Training Is Massive Cross Entropy Optimisation
During pretraining, the model sees a huge corpus of token sequences. For each position it produces logits over the vocabulary. Those logits become probabilities through softmax. The loss compares the predicted distribution against the actual next token from the corpus, usually with cross entropy.
If the true next token is y and the predicted probability for that token is p(y), then the loss contribution is:
L = -log p(y)Backpropagation computes gradients for all the weights and an optimiser such as AdamW updates them. This happens over trillions of token examples across many GPUs.
A simplified training loop looks like this:
for batch in dataloader:
logits = model(batch.tokens[:, :-1])
loss = cross_entropy(logits, batch.tokens[:, 1:])
loss.backward()
optimizer.step()
optimizer.zero_grad()The real system is much more complicated:
- data sharding across many workers
- mixed precision arithmetic
- gradient accumulation
- tensor parallelism and pipeline parallelism
- checkpointing to survive machine failure
- learning rate schedules
- careful curation and filtering of the corpus
Still, the optimisation target remains simple. The model keeps adjusting its internal representation until it gets better at predicting the token that historically came next.
Batch Size, Learning Rate, And Optimiser State Decide Whether Training Stays Stable
Once the data is prepared, the next practical problem is updating the weights at scale without diverging. Frontier training runs do not simply pick a random batch size and hope. They tune optimiser hyperparameters carefully because the parameter count, sequence length, precision format, and cluster size all interact.
AdamW remains common because it keeps moving averages of gradients and squared gradients, which helps training stay stable across very large parameter spaces. The price is memory. For each trainable weight, the system may need to store the parameter itself, the gradient, a first moment estimate, and a second moment estimate. Optimiser state sharding matters so much on large runs for that reason. The raw model weights are only part of the memory footprint.
Learning rate schedules are equally important. A run often begins with warmup so very large initial updates do not destabilise early training, then decays the learning rate over time as the model approaches a better region of parameter space. A schedule that is too aggressive can cause loss spikes. One that is too timid can waste enormous compute budgets converging too slowly.
Batching introduces another trade off. Larger global batch sizes improve hardware utilisation and smooth the gradient estimate, but they also change optimisation dynamics. The cluster in effect sees fewer parameter updates per token processed. That can be good or bad depending on the regime. Scaling laws and empirical tuning help decide where the training run remains compute efficient without ruining convergence.
Post Training Changes Behaviour More Than Most Users Realise
A pretrained base model is often not what users interact with. Post training stages usually include supervised fine tuning and preference optimisation. The goal is to make the raw next token predictor more useful as an assistant.
Supervised fine tuning trains on prompt response pairs. This teaches instruction following, dialogue formatting, and task specific styles. Preference tuning then shifts behaviour further by ranking outputs that humans or ranking models prefer.
Operationally this means two models with the same transformer backbone can feel very different:
- one may be terse and literal
- one may be more chatty
- one may refuse more categories of content
- one may format code more reliably
The core architecture is unchanged. What changes is the distribution the model has been nudged toward after pretraining.
Inference Is A Different Problem From Training
Training is about changing weights. Inference freezes weights and focuses on serving requests efficiently. This changes the bottlenecks completely.
At inference time, the expensive part for a long prompt is often the initial "prefill" phase. The model must process every prompt token through all layers and compute the attention keys and values for each position. Once that is done, token generation becomes an incremental decode loop. Each new token only needs attention against the existing cache rather than recomputing the entire prompt from scratch.
This is where the KV cache appears.
For each layer and attention head, the model stores the key and value tensors for prior tokens. When generating token t+1, it computes the new query for that position and attends against cached keys and values from positions 1..t. Without KV cache, each next token would require recomputing all previous keys and values again, which would be far slower.
The trade off is memory. KV cache size grows with:
- number of layers
- number of heads
- head dimension
- sequence length
- batch size
An LLM service with many concurrent long context sessions can become memory bound long before it becomes compute bound.
Prefill And Decode Stress Hardware In Different Ways
Inference is often described as one continuous generation process, but serving systems usually divide it mentally into two phases because the hardware bottlenecks are different.
Prefill is the first pass over the prompt. Every token in the input sequence must pass through every layer so the model can build the hidden states and KV cache. Prefill has a lot of parallel work across prompt positions, so it can utilise accelerator matrix units well. The problem is sheer volume. A 20,000 token prompt forces the server to do a large amount of dense work before the first output token appears.
Decode is different. Once the cache exists, each next token only introduces one new position. The server is now generating incrementally, often one token per user per step. Decode can become memory bandwidth bound because the system repeatedly reads cached keys and values from many layers while doing relatively smaller fresh matrix multiplies for the new token.
That distinction explains many real product behaviours:
- long prompts create large time to first token because prefill dominates
- long responses create sustained GPU occupancy because decode keeps running
- prompt caching helps repeated system prompts because it reduces repeated prefill work
- speculative decoding can speed generation by reducing how often the large model must be consulted
Speculative decoding is a good example. A smaller draft model predicts several candidate tokens quickly. The larger model then verifies them in chunks. If the draft guesses well, the service emits more output per expensive large model pass. This is not changing the transformer mathematics. It is changing the serving strategy around it.
KV Cache Is Why Long Context Costs Real Money
Suppose a model has 32 layers, hidden dimension 4,096, and uses half precision values. The cached keys and values for a 16,000 token context can easily consume several gigabytes per active session. Multiply that by many users and memory planning becomes one of the central problems of inference engineering.
This affects design choices everywhere:
- hosted APIs charge more for long prompts because they occupy memory and compute longer
- chat systems truncate or summarise old history
- some inference engines page KV cache between GPU and CPU memory
- continuous batching systems merge decode steps from many users to improve GPU utilisation
The model quality discussion is often public. The cache management discussion is usually what determines whether the service is economically viable.
Sampling Turns Logits Into Actual Text
The network output at each step is a vector of logits over the vocabulary. To emit a token, the system must convert those logits into a sampling decision.
Greedy decoding simply picks the highest probability token every time. This is deterministic and often useful for narrow tasks, but it can become repetitive or brittle. Sampling methods reshape the distribution first:
- temperature scales logits, flattening or sharpening the distribution
- top k keeps only the highest
kcandidates - top p keeps the smallest token set whose cumulative probability exceeds a threshold
A typical decode loop is:
logits = model.next_token_logits(context)
logits = logits / temperature
probs = softmax(logits)
probs = nucleus_filter(probs, p=0.9)
token = sample(probs)A lower temperature tends to make answers more predictable. A higher temperature increases diversity but also error rate. This is not "creativity" in a human sense. It is simply how much randomness you allow when drawing from the next token distribution.
Quantisation Trades Numerical Precision For Speed And Fit
Full precision weights are expensive. A model with 70 billion parameters stored at 16 bits per parameter needs about 140 gigabytes just for the raw weights, before accounting for cache and runtime overhead. That is too large for many deployment targets.
Quantisation reduces the number of bits used to store weights, and sometimes activations. Common formats include 8 bit, 6 bit, 4 bit, and even lower in specialised schemes. The practical result is that a model that needed several accelerators at FP16 may fit on one GPU or even run partly on consumer hardware when quantised.
The trade off is approximation error. Good quantisation schemes group weights, preserve scale information, and minimise accuracy loss, but the loss is never zero. Some layers are more sensitive than others. Some tasks degrade more than others.
For local inference this matters enormously. A model that is too large to fit in memory is unusable. A slightly weaker quantised model that fits and runs at 20 tokens per second is often the better system.
Mixture Of Experts Changes Compute Economics
Some modern LLMs use mixture of experts architectures rather than dense feed forward layers everywhere. In an MoE block, a router selects a small subset of expert networks for each token instead of activating every expert on every pass.
This changes the economics:
- total parameter count can grow very large
- per token active compute can stay lower than a dense model of similar total size
- routing and load balancing become important engineering problems
An MoE model might have hundreds of billions of total parameters but only activate a fraction for each token. That makes comparisons such as "parameter count equals capability" even less reliable than before.
Long Context Is Constrained By Memory And Signal Quality
Long context windows are marketed as if they were pure upside. In reality they are bounded by two different limits.
The first limit is physical. More context means more prefill compute and more KV cache memory. A model serving long prompts to many users at once can become memory bound long before it runs out of theoretical floating point capacity.
The second limit is qualitative. A model can technically accept a very long prompt without using every part of it equally well. Important facts buried deep inside a long noisy prompt may still be ignored if more recent or more repeated cues dominate attention patterns. The context window is the maximum available workspace, not a promise that every token will influence the answer equally.
This is why prompt construction and retrieval quality still matter. A system in Paris that dumps every potentially relevant document chunk into the prompt may stay under the token limit and still answer badly because too much irrelevant text is competing for attention. Good systems select and compress context rather than treating the window as infinite.
Serving At Scale Is Mostly A Systems Problem
By the time a model is trained, the next challenge is running it cheaply and reliably. Production inference servers handle:
- request admission control
- prompt tokenisation
- prefill scheduling
- continuous batching of decode steps
- KV cache allocation and eviction
- streaming partial tokens back to clients
- fallback if a worker fails
- model versioning and rollout
A cluster in London or Frankfurt serving interactive traffic cannot just queue one request at a time. It tries to keep the accelerators busy by grouping compatible work. If one user asks for a 30 token reply and another asks for a 2,000 token answer, the scheduler has to balance latency against throughput. Continuous batching helps because many sessions can share one decode cycle, but cache fragmentation and sequence length imbalance complicate it.
This is why specialised inference engines exist. The raw transformer math is only part of the product. Efficient paging, kernel fusion, tensor parallel layout, and batching policy are what make the difference between a demo and a service.
Retrieval And Tool Use Add Information Outside The Weights
Many users talk about models as if every answer comes from the parameters alone. In deployed systems that is often false. Retrieval augmented generation and tool use change the effective system architecture even when the underlying transformer remains the same.
With retrieval, the application first searches a document set, vector index, or hybrid ranking system. It then inserts selected passages into the prompt. The model still performs next token prediction, but now it is conditioning on external facts fetched at request time. This is one of the main ways teams reduce hallucinations for domain specific tasks.
Tool use goes further. The model may decide to call a web search tool, a SQL query engine, a calculator, a code execution sandbox, or an internal API. The returned results are fed back into the context, and generation continues. To the user this can feel like the model "knows" the answer. Mechanically the model is orchestrating a loop in which external systems provide up to date state and exact computation. The transformer is still the controller that interprets context and predicts the next action or token, but the overall product is no longer just a static language model.
This distinction matters because it clarifies what the weights really store. The weights contain compressed statistical structure learned during training. They do not contain a continuously updated live database of every current fact. Retrieval and tools are how deployed systems bridge that gap.
Why Hallucinations Happen In Mechanical Terms
Hallucination is often discussed as if it were a separate module that sometimes switches on. In practice it is a direct consequence of the training objective. The model is rewarded for predicting likely continuations, not for checking a live database of facts before each token.
If the context strongly suggests a pattern such as a citation format, a likely looking function name, or a plausible sounding RFC section, the model may generate one even when it does not correspond to a real source. From the model's perspective the token sequence is statistically coherent. From the user's perspective it is fabricated.
Retrieval augmented generation changes this by injecting external facts into the prompt. Tool use changes it by letting the system call search, code execution, or databases. But the base model itself is still a token predictor, not a truth oracle.
Fine Tuning Usually Changes A Small Fraction Of The Full System Budget
People often talk about "training a model" and "fine tuning a model" as if they were equally large operations. They are not. Pretraining is where the model absorbs the vast bulk of its statistical structure. Fine tuning is usually a much narrower stage that shifts behaviour in a targeted direction.
The simplest fine tune updates all model weights on a smaller dataset. That works, but it is expensive for very large models. In practice many teams use parameter efficient methods such as adapters or LoRA style low rank updates. Instead of rewriting every weight matrix, they learn smaller additional matrices that alter the effective transformation at selected layers.
This changes deployment options:
- one base model can support multiple specialised adapters
- memory cost is lower than storing full fine tuned copies
- switching behaviour becomes cheaper operationally
For example, a team might keep one dense base model for general language competence and then maintain separate lightweight adapters for customer support tone, financial document extraction, or code review guidance. The base statistical machinery remains the same. The adapter nudges the computation in a specific domain direction.
This is also why many open weight communities can iterate quickly. Once a strong pretrained backbone exists, a much smaller training budget can create useful downstream variants. That does not make the process trivial, but it means the marginal cost of a specialised derivative model is far lower than the cost of the original pretraining run.
Context Windows Are A Product Feature Built On Several Different Tricks
When a model advertises a 128,000 token or 1,000,000 token context window, that number is not coming from one switch being flipped. It is the result of several interacting design choices:
- positional encoding behaviour at long ranges
- memory management in the inference engine
- training exposure to longer sequences
- sometimes architectural tricks such as grouped attention or cache compression
A model may technically accept a long prompt because the serving stack can allocate the KV cache and because the positional method still produces coherent rotations or embeddings at that length. But actual usefulness depends on whether training taught the model to use that much context intelligently.
This is why long context evaluations are harder than they first appear. A system may pass a needle in a haystack benchmark where one exact fact is repeated and then retrieved later, but still perform poorly on realistic long documents full of competing detail, duplicated phrases, and distracting structure. Good long context behaviour is not simply "the model did not crash". It is the model continuing to prioritise the right information as the prompt grows.
Operationally, long context support also changes economics. If an organisation in Amsterdam offers very long context by default, it must provision enough memory headroom that a few large requests do not degrade latency for every smaller interactive user sharing the same accelerator pool. Context windows are therefore both a modelling choice and a capacity planning choice.
Fine Tuning Usually Changes A Small Fraction Of The Full System Budget
People often talk about "training a model" and "fine tuning a model" as if they were equally large operations. They are not. Pretraining is where the model absorbs the vast bulk of its statistical structure. Fine tuning is usually a much narrower stage that shifts behaviour in a targeted direction.
The simplest fine tune updates all model weights on a smaller dataset. That works, but it is expensive for very large models. In practice many teams use parameter efficient methods such as adapters or LoRA style low rank updates. Instead of rewriting every weight matrix, they learn smaller additional matrices that alter the effective transformation at selected layers.
This changes deployment options:
- one base model can support multiple specialised adapters
- memory cost is lower than storing full fine tuned copies
- switching behaviour becomes cheaper operationally
For example, a team might keep one dense base model for general language competence and then maintain separate lightweight adapters for customer support tone, financial document extraction, or code review guidance. The base statistical machinery remains the same. The adapter nudges the computation in a specific domain direction.
This is also why many open weight communities can iterate quickly. Once a strong pretrained backbone exists, a much smaller training budget can create useful downstream variants. That does not make the process trivial, but it means the marginal cost of a specialised derivative model is far lower than the cost of the original pretraining run.
Context Windows Are A Product Feature Built On Several Different Tricks
When a model advertises a 128,000 token or 1,000,000 token context window, that number is not coming from one switch being flipped. It is the result of several interacting design choices:
- positional encoding behaviour at long ranges
- memory management in the inference engine
- training exposure to longer sequences
- sometimes architectural tricks such as grouped attention or cache compression
A model may technically accept a long prompt because the serving stack can allocate the KV cache and because the positional method still produces coherent rotations or embeddings at that length. But actual usefulness depends on whether training taught the model to use that much context intelligently.
This is why long context evaluations are harder than they first appear. A system may pass a needle in a haystack benchmark where one exact fact is repeated and then retrieved later, but still perform poorly on realistic long documents full of competing detail, duplicated phrases, and distracting structure. Good long context behaviour is not simply "the model did not crash". It is the model continuing to prioritise the right information as the prompt grows.
Operationally, long context support also changes economics. If an organisation in Amsterdam offers very long context by default, it must provision enough memory headroom that a few large requests do not degrade latency for every smaller interactive user sharing the same accelerator pool. Context windows are therefore both a modelling choice and a capacity planning choice.
A Full Inference Request Is A Queueing Problem As Much As A Math Problem
From the user side, one chat completion request looks simple. A prompt is sent, tokens stream back, and the session ends. Inside a production serving stack the same request usually passes through several distinct stages, each with its own bottleneck.
A simplified path looks like this:
- the API layer authenticates the request and enforces quotas
- the tokenizer converts text into token IDs
- the scheduler decides which worker or worker group will handle the request
- the prefill phase builds the KV cache for the prompt
- the decode loop begins producing output tokens
- sampling logic chooses emitted tokens
- partial results stream back to the client while the session state remains resident
- the cache is released or retained according to product policy
At small scale this pipeline is easy to hide. At large scale every stage matters. A cluster can be compute efficient and still feel slow if requests queue too long before prefill begins. It can have fast prefill and still produce poor tail latency if a few giant sessions occupy too much memory for too long. It can have good raw throughput and still waste money if the scheduler does a poor job of grouping compatible decode steps together.
This is why LLM serving feels like systems engineering rather than only machine learning. The transformer forward pass is the core computation, but user experience depends on admission control, batching policy, prompt cache hit rate, and how aggressively the service isolates long context requests from short interactive ones.
Weight Count, Memory Footprint, And Throughput Are Related But Not Identical
Users often collapse model deployment questions into one simple dimension: how many parameters does the model have. Parameter count matters, but it is not the only useful number. A deployment engineer usually cares about at least three related but different quantities:
- total parameter count
- resident memory footprint after quantisation and runtime overhead
- achievable tokens per second at a given batch and context size
A 70 billion parameter model may be too large for one accelerator in FP16, practical in 4 bit form across a smaller number of devices, and still disappointingly slow if the workload uses long prompts that dominate memory bandwidth. A smaller dense model may fit comfortably and deliver better real user latency even if its raw benchmark score is lower.
This is why local inference communities spend so much time on quantisation formats, GPU memory maps, offload policies, and attention kernel choices. The question is not just "can the model run". The question is "can it run at acceptable speed with acceptable quality on the hardware and prompt lengths that actually exist".
For hosted services the same logic appears in a different form. The provider is balancing:
- quality expectations from the product team
- GPU cost per hour
- context window promises
- concurrency targets
- latency targets
The best model on a benchmark chart may be the wrong deployment choice if its memory and throughput profile makes the product uneconomic. This is one reason model selection is often less glamorous inside companies than it looks from the outside. It is a capacity planning decision dressed up as an AI decision.
The Right Mental Model
The most useful mental model is not "the model understands exactly like a human" and not "the model is just autocomplete in a trivial sense". Both are too crude.
An LLM is a very large conditional sequence model built from transformer blocks. It maps token histories to probability distributions over the next token. Attention gives it a powerful content based lookup mechanism over prior context. Deep stacks of residual blocks let it compose many layers of representation. Training on vast corpora compresses many regularities of text into weight matrices. Inference engineering then turns those matrices into a service by solving memory, scheduling, and latency problems.
That is enough to explain most of what users see:
- fluent continuation
- sensitivity to prompt wording
- strong code and prose pattern completion
- large context costs
- improvement from retrieval and tools
- failure modes that look confident because the generated sequence is locally plausible
The architecture is not mystical. It is also not shallow. A modern LLM is a dense concentration of statistics, linear algebra, optimisation, and systems engineering. If you keep next token prediction at the centre, everything else around it becomes easier to reason about.
The tokeniser decides the symbols. The embeddings map those symbols into vectors. Positional encoding preserves order. Attention decides what prior context matters. Feed forward blocks transform representations. Layer stacks deepen the computation. Cross entropy training tunes the weights. KV cache makes autoregressive decoding practical. Quantisation and batching make deployment affordable.
That is the machine.