← Back to Logs

How Search Engines Rank Pages: From Crawling to Neural Retrieval

Try the interactive lab for this articleTake the quiz (6 questions · ~5 min)

Every time you type a query into a search engine, something extraordinary happens behind the scenes. A system that indexes hundreds of billions of web pages identifies, scores, and returns the ten most relevant results in under 200 milliseconds. That is not a database lookup. It is a distributed systems problem layered with an information retrieval problem layered with a machine learning problem, all running simultaneously across thousands of machines in data centres from Dublin to Singapore.

This post walks through the full pipeline: how pages are discovered, how they are indexed, how they are scored, and how modern neural models are changing the entire retrieval paradigm. If you have ever wondered what actually happens between pressing Enter and seeing results, this is the technical answer.

1. The Scale of the Problem

Google's index contains hundreds of billions of pages. Bing's is smaller but still measures in the tens of billions. The raw text alone, before any metadata or link graph, runs into petabytes. And this corpus is not static. Pages are created, updated, and deleted constantly. News articles appear every second. E-commerce sites change prices multiple times per day. Spam pages are generated by the millions.

A search engine must do three things with this corpus:

  1. Discover pages by crawling the web.
  2. Organize pages into an index that supports millisecond lookups.
  3. Rank pages for any given query using a combination of text matching, link analysis, user signals, and learned models.

Each stage is hard on its own. The combination is what makes web search one of the most demanding engineering problems ever built.

Consider the constraints. A user in Barcelona types "best sourdough starter recipe" and expects results in under 200ms. In that window, the search engine must:

  • Parse and understand the query.
  • Retrieve candidate documents from an index distributed across thousands of shards.
  • Score each candidate against hundreds of features.
  • Re-rank the top results with a heavy ML model.
  • Generate snippets.
  • Apply personalization.
  • Return the response.

All of this over a network that spans continents. The latency budget for each stage is measured in single-digit milliseconds.

2. Crawling: Discovering the Web

Before you can rank pages, you need to find them. That is the job of the crawler.

The Crawl Frontier

A crawler maintains a URL queue, commonly called the crawl frontier. The process is straightforward in concept:

  1. Pick a URL from the frontier.
  2. Fetch the page.
  3. Parse the HTML.
  4. Extract all outgoing links.
  5. Add new links to the frontier.
  6. Repeat.

In practice, this simple loop hides enormous complexity. Google's crawler (Googlebot) and Microsoft's (bingbot) operate at a scale of billions of fetches per day. The frontier is not a simple FIFO queue. It is a priority queue with scheduling policies that consider:

  • Page importance: Pages with more inbound links get crawled more often.
  • Update frequency: A news site's front page might be re-crawled every few minutes. A static academic paper might be re-crawled once a month.
  • Crawl budget: For any given domain, the crawler allocates a limited number of fetches per unit of time. A site with 50 million pages will not have all of them crawled every day. The site owner can influence priorities via XML sitemaps.

Politeness and robots.txt

Crawlers must be polite. Hammering a server with thousands of concurrent requests would amount to a denial-of-service attack. The standard mechanism is robots.txt, a file at the root of every domain that declares which paths crawlers may or may not visit, and optionally specifies a crawl delay.

User-agent: Googlebot
Disallow: /admin/
Disallow: /tmp/
Crawl-delay: 2
 
User-agent: *
Disallow: /private/

The Crawl-delay directive tells the crawler to wait at least 2 seconds between consecutive requests to this host. Not all crawlers respect it, but the major ones do.

Near-Duplicate Detection

The web is full of duplicates. Content scrapers, syndication networks, printer-friendly versions, HTTP vs HTTPS variants, www vs non-www. A search engine that indexes every duplicate wastes storage and risks showing users the same content multiple times.

Near-duplicate detection typically uses locality-sensitive hashing algorithms:

  • SimHash: Developed at Google, SimHash produces a fingerprint for each document. Documents with fingerprints that differ in only a few bits are considered near-duplicates. The Hamming distance between two SimHash values correlates with the cosine similarity of the underlying documents.
  • MinHash: Based on Jaccard similarity of shingle sets (short overlapping substrings of the document). MinHash estimates the probability that two documents share the same shingles. It is commonly used with Locality-Sensitive Hashing (LSH) to find candidate pairs efficiently.

Both techniques allow duplicate detection at web scale without computing all-pairs similarity, which would be computationally infeasible.

The Discovered Web vs the Deep Web

Crawlers can only follow links. Content behind login walls, dynamically generated pages without inbound links, and databases that expose results only through search forms remain invisible. This is the deep web, and it is estimated to be many times larger than the surface web that search engines index. Some search engines attempt limited deep web crawling by submitting queries to known forms, but the coverage is inherently incomplete.

3. The Inverted Index

Once pages are crawled and parsed, they must be stored in a structure that supports fast retrieval. The fundamental data structure of search is the inverted index.

Forward Index vs Inverted Index

A forward index maps each document to the terms it contains:

Document 1 → ["search", "engine", "ranking", "algorithm"]
Document 2 → ["search", "query", "processing", "latency"]
Document 3 → ["ranking", "neural", "transformer", "BERT"]

This is how you would naturally think about documents. But answering the query "search ranking" requires scanning every document to check if it contains those terms. With billions of documents, that is too slow.

An inverted index flips the mapping. It maps each term to the list of documents that contain it:

"search"      → [Doc1, Doc2]
"engine"      → [Doc1]
"ranking"     → [Doc1, Doc3]
"algorithm"   → [Doc1]
"query"       → [Doc2]
"processing"  → [Doc2]
"latency"     → [Doc2]
"neural"      → [Doc3]
"transformer" → [Doc3]
"BERT"        → [Doc3]

Now answering "search ranking" is fast: intersect the postings lists for "search" and "ranking" to get [Doc1].

Postings List Structure

Each entry in a postings list is more than just a document ID. A typical posting includes:

Field Description
Document ID Unique identifier for the document
Term frequency How many times the term appears in this document
Field Where the term appeared (title, body, URL, anchor text)
Positions The exact positions of the term within the document

Positions enable phrase queries ("search engine ranking" as an exact phrase) and proximity scoring (terms that appear near each other are more likely to be relevant).

Compression

At web scale, the inverted index is enormous. Compression is essential. Two key techniques:

Delta encoding: Instead of storing absolute document IDs, store the difference between consecutive IDs in a postings list. If the list is [101, 105, 106, 120], store [101, 4, 1, 14]. The deltas are smaller numbers that compress better.

Variable-byte encoding (VByte): Represent each integer using a variable number of bytes. Small numbers use one byte. Larger numbers use more. The high bit of each byte indicates whether more bytes follow.

Value: 5      → 1 byte:  10000101
Value: 300    → 2 bytes: 00000010 10101100
Value: 100000 → 3 bytes: 00000110 00001101 10100000

These techniques reduce index size by 4x to 10x while still allowing fast decoding during query processing.

Index Segments and Merging

Systems like Apache Lucene (which underlies Elasticsearch and Solr) use a segmented index architecture. New documents are first written to an in-memory buffer. When the buffer fills, it is flushed to disk as an immutable index segment. Periodically, segments are merged into larger segments.

This design has several advantages:

  • Writes are fast because they go to memory first.
  • Segments are immutable, so reads do not need locks.
  • Merging amortizes the cost of reorganising the index.

The tradeoff is that queries must search across multiple segments and merge results, but in practice this overhead is small compared to the benefits.

4. TF-IDF: The Classic Scoring Model

With the inverted index in place, we need a way to score how relevant each document is to a query. The oldest and most intuitive approach is TF-IDF.

Term Frequency (TF)

Term frequency measures how often a query term appears in a document. The intuition is simple: if a document mentions "PageRank" 15 times, it is probably more about PageRank than a document that mentions it once.

Raw term frequency can be used directly, but it is common to dampen it with a logarithm to prevent long documents from dominating:

tf(t, d) = 1 + log(f(t, d))    if f(t, d) > 0
tf(t, d) = 0                    otherwise

where f(t, d) is the raw count of term t in document d.

Inverse Document Frequency (IDF)

Not all terms are equally informative. The word "the" appears in nearly every English document. The word "Fourier" appears in very few. IDF captures this:

idf(t) = log(N / df(t))

where N is the total number of documents in the corpus and df(t) is the number of documents containing term t.

A term that appears in every document has df(t) = N, giving idf = log(1) = 0. A term that appears in only 10 out of 10 million documents has idf = log(1,000,000) = 6. Rare terms get high IDF; common terms get low IDF.

The TF-IDF Score

The combined score for a term in a document is:

TF-IDF(t, d) = tf(t, d) × idf(t)

For a multi-term query, the document's total score is typically the sum across all query terms:

score(q, d) = Σ TF-IDF(t, d)  for each term t in query q

Worked Example

Suppose we have a corpus of 10,000 documents and we query for "Fourier transforms."

Term Document frequency (df) IDF = log(10000/df)
Fourier 50 log(200) = 5.30
transforms 200 log(50) = 3.91

For Document A, which contains "Fourier" 8 times and "transforms" 3 times:

tf("Fourier", A)    = 1 + log(8)  = 1 + 2.08 = 3.08
tf("transforms", A) = 1 + log(3)  = 1 + 1.10 = 2.10
 
TF-IDF("Fourier", A)    = 3.08 × 5.30 = 16.32
TF-IDF("transforms", A) = 2.10 × 3.91 = 8.21
 
Total score(A) = 16.32 + 8.21 = 24.53

For Document B, which contains "Fourier" once and "transforms" once:

tf("Fourier", B)    = 1 + log(1) = 1.00
tf("transforms", B) = 1 + log(1) = 1.00
 
TF-IDF("Fourier", B)    = 1.00 × 5.30 = 5.30
TF-IDF("transforms", B) = 1.00 × 3.91 = 3.91
 
Total score(B) = 5.30 + 3.91 = 9.21

Document A scores 24.53 vs Document B's 9.21. The document that discusses Fourier transforms extensively ranks higher.

Limitations of TF-IDF

TF-IDF treats documents as bags of words. It has no understanding of:

  • Word order ("dog bites man" vs "man bites dog")
  • Synonyms ("car" vs "automobile")
  • Semantics ("bank" the financial institution vs "bank" the river bank)
  • Document length (a 10,000-word document naturally accumulates higher term frequencies than a 200-word document)

These limitations motivated the development of BM25.

5. BM25: The Better Scoring Function

Okapi BM25 (Best Matching 25), developed at City University London in the 1990s, is the workhorse of modern text retrieval. It addresses the key weaknesses of raw TF-IDF while remaining computationally cheap enough for first-stage retrieval over billions of documents.

The BM25 Formula

For a query Q containing terms q1, q2, ..., qn, the BM25 score of a document D is:

score(D, Q) = Σ IDF(qi) × [ f(qi, D) × (k1 + 1) ] / [ f(qi, D) + k1 × (1 - b + b × |D| / avgdl) ]

where:

Symbol Meaning
f(qi, D) Frequency of term qi in document D
|D| Length of document D (in terms)
avgdl Average document length across the corpus
k1 Term frequency saturation parameter (typically 1.2 to 2.0)
b Length normalization parameter (typically 0.75)
IDF(qi) Inverse document frequency of term qi

The IDF component in BM25 is usually computed as:

IDF(qi) = log( (N - n(qi) + 0.5) / (n(qi) + 0.5) + 1 )

where N is the total number of documents and n(qi) is the number of documents containing qi.

Why BM25 Beats TF-IDF

Two critical improvements:

1. Term frequency saturation. In TF-IDF, doubling the term frequency doubles the contribution. In BM25, the contribution plateaus. The k1 parameter controls how quickly saturation occurs. With k1 = 1.2, the difference between a term appearing 10 times vs 100 times is small. This prevents keyword-stuffed documents from dominating.

The saturation curve looks like this:

BM25 TF component

 2.2 ┤                    _______________
     │                ___/
 1.8 ┤            ___/
     │         __/
 1.4 ┤       _/
     │     _/
 1.0 ┤   _/
     │  /
 0.6 ┤ /
     │/
 0.2 ┤
     └──────────────────────────────────→
     0    5    10   15   20   25   30
                Term Frequency

2. Document length normalization. The b parameter controls how much document length affects scoring. With b = 0, there is no length normalization. With b = 1, full normalization. The typical value of b = 0.75 penalizes long documents slightly, on the assumption that a term appearing in a short, focused document is more significant than the same term appearing in a long, rambling one. But the penalty is not so severe that long, genuinely thorough documents are unfairly punished.

Worked Example

Let us score a query "distributed consensus" against two documents.

Corpus statistics:

  • N = 1,000,000 documents
  • avgdl = 500 terms
  • n("distributed") = 12,000
  • n("consensus") = 3,000
  • k1 = 1.2, b = 0.75

IDF calculations:

IDF("distributed") = log((1000000 - 12000 + 0.5) / (12000 + 0.5) + 1)
                   = log(988000.5 / 12000.5 + 1)
                   = log(82.33 + 1)
                   = log(83.33)
                   = 4.42
 
IDF("consensus")   = log((1000000 - 3000 + 0.5) / (3000 + 0.5) + 1)
                   = log(997000.5 / 3000.5 + 1)
                   = log(332.28 + 1)
                   = log(333.28)
                   = 5.81

Document A: 300 terms, "distributed" appears 5 times, "consensus" appears 8 times.

BM25("distributed", A) = 4.42 × (5 × 2.2) / (5 + 1.2 × (1 - 0.75 + 0.75 × 300/500))
                       = 4.42 × 11.0 / (5 + 1.2 × (0.25 + 0.45))
                       = 4.42 × 11.0 / (5 + 1.2 × 0.70)
                       = 4.42 × 11.0 / (5 + 0.84)
                       = 4.42 × 11.0 / 5.84
                       = 4.42 × 1.883
                       = 8.32
 
BM25("consensus", A)   = 5.81 × (8 × 2.2) / (8 + 1.2 × 0.70)
                       = 5.81 × 17.6 / (8 + 0.84)
                       = 5.81 × 17.6 / 8.84
                       = 5.81 × 1.991
                       = 11.57
 
Total BM25(A) = 8.32 + 11.57 = 19.89

Document B: 2,000 terms, "distributed" appears 15 times, "consensus" appears 3 times.

Length factor = 1 - 0.75 + 0.75 × 2000/500 = 0.25 + 3.0 = 3.25
 
BM25("distributed", B) = 4.42 × (15 × 2.2) / (15 + 1.2 × 3.25)
                       = 4.42 × 33.0 / (15 + 3.9)
                       = 4.42 × 33.0 / 18.9
                       = 4.42 × 1.746
                       = 7.72
 
BM25("consensus", B)   = 5.81 × (3 × 2.2) / (3 + 1.2 × 3.25)
                       = 5.81 × 6.6 / (3 + 3.9)
                       = 5.81 × 6.6 / 6.9
                       = 5.81 × 0.957
                       = 5.56
 
Total BM25(B) = 7.72 + 5.56 = 13.28

Document A (shorter, focused, high term frequency for "consensus") scores 19.89 vs Document B's 13.28. Despite Document B mentioning "distributed" three times as often, the length normalization and saturation effects of BM25 correctly favour the more focused document.

6. PageRank: Link Analysis

In the late 1990s, text-based scoring alone could not solve the quality problem. A spammer could stuff a page with keywords and rank for anything. Larry Page and Sergey Brin's insight was that links encode human editorial judgment. If many reputable pages link to a page, that page is probably valuable.

The Random Surfer Model

PageRank models a hypothetical random surfer who starts on a random page and follows links. At each step, the surfer either:

  • Follows a random outgoing link from the current page (with probability d), or
  • Jumps to a completely random page (with probability 1 - d).

The damping factor d is typically set to 0.85. The PageRank of a page is the long-run probability that the random surfer lands on that page.

The Formula

The PageRank of page i is defined recursively:

PR(i) = (1 - d) / N + d × Σ PR(j) / L(j)

where the sum is over all pages j that link to page i, L(j) is the number of outgoing links from page j, and N is the total number of pages.

In matrix form, if M is the column-stochastic adjacency matrix (each column sums to 1), PageRank is the stationary distribution of the Markov chain:

PR = (1 - d) / N × 1 + d × M × PR

This is an eigenvector problem. The PageRank vector is the principal eigenvector of the modified adjacency matrix.

Power Iteration

Computing PageRank for billions of pages uses the power iteration method:

import numpy as np
 
def pagerank(M, d=0.85, epsilon=1e-8, max_iter=100):
    """
    M: adjacency matrix (M[i][j] = 1/L(j) if j links to i, else 0)
    d: damping factor
    """
    N = M.shape[0]
    pr = np.ones(N) / N  # uniform initial distribution
 
    for iteration in range(max_iter):
        pr_new = (1 - d) / N + d * M @ pr
 
        # Check convergence
        if np.abs(pr_new - pr).sum() < epsilon:
            print(f"Converged after {iteration + 1} iterations")
            return pr_new
 
        pr = pr_new
 
    return pr

For the full web graph, this computation is distributed across thousands of machines using MapReduce or similar frameworks. Each iteration requires a full pass over the link graph. Convergence typically occurs within 50 to 100 iterations.

A Small Example

Consider four pages with the following link structure:

A → B, C
B → C
C → A
D → A, B, C

Page A has links from C and D. Page B has links from A and D. Page C has links from A, B, and D. Page D has no incoming links (except the random jump).

After running power iteration with d = 0.85:

Page PageRank
A 0.31
B 0.20
C 0.34
D 0.15

Page C has the highest PageRank because it receives links from three other pages. Page D has the lowest because no one links to it; its only PageRank comes from the random jump component.

Why PageRank Alone Is Not Enough

PageRank is query-independent. A page has a single PageRank value regardless of what the user is searching for. This means:

  • A high-PageRank page about cooking will rank well for cooking queries, which is correct.
  • But that same cooking page might also rank for completely unrelated queries if it happens to contain matching terms.

Additionally, PageRank is vulnerable to manipulation:

  • Link farms: Networks of pages that link to each other to inflate PageRank.
  • Paid links: Buying links from high-PageRank sites.
  • Spam blogs: Auto-generated content with strategic outbound links.

Modern search engines use PageRank as one signal among hundreds, not as the sole ranking factor. They also analyse link diversity (links from many different domains are worth more than many links from one domain), anchor text (the clickable text of a link provides context about the target page), and various spam detection heuristics.

7. Query Processing Pipeline

When a user in Berlin types "how does HTTPS work" and presses Enter, the following pipeline executes in roughly 200 milliseconds.

Step 1: Query Parsing and Understanding

The raw query string is tokenized and analysed:

  • Spell correction: "hwo does HTPS work" is corrected to "how does HTTPS work." This uses a combination of language models, query logs, and edit distance algorithms.
  • Query expansion: Synonyms and related terms may be added. "HTTPS" might expand to include "TLS" and "SSL."
  • Intent classification: Is the user looking for a tutorial, a Wikipedia article, a news story? The intent affects which documents and features are prioritized.
  • Entity recognition: "HTTPS" is recognized as a protocol. "Berlin" in a local query is recognized as a city.

Step 2: Retrieval (First Stage)

The expanded query is run against the inverted index using BM25 (or a similar fast scoring function). This stage is designed for recall, not precision. The goal is to retrieve the top 1,000 to 10,000 candidate documents quickly.

The inverted index is sharded across many machines. Each shard processes the query independently and returns its local top-K results. A coordinator merges these partial results.

Query: "how does HTTPS work"
 
Shard 1: [Doc_4521 (score 18.3), Doc_9102 (score 16.1), ...]
Shard 2: [Doc_2310 (score 19.7), Doc_7788 (score 15.4), ...]
Shard 3: [Doc_1145 (score 17.9), Doc_5566 (score 14.8), ...]
...
 
Coordinator merges → Top 1000 candidates globally

Step 3: Re-ranking (Second Stage)

The top candidates from stage one are now scored by a much more expensive model. This model has access to hundreds of features:

  • BM25 score from stage one
  • PageRank of the document
  • Click-through rate for this query-document pair
  • Document freshness (when was it last updated?)
  • Content quality signals
  • Whether the query terms appear in the title, URL, or headings
  • User location and language
  • Mobile-friendliness of the page

In modern systems, this re-ranker is often a neural network (more on this in Section 9).

Step 4: Snippet Generation

For each result on the first page, the search engine must generate a snippet: a short excerpt that shows the user why this document is relevant. The snippet is typically the passage from the document that best matches the query, with query terms highlighted.

Step 5: Personalization

Results may be adjusted based on the user's search history, location, language, and device. A user in Athens searching for "football" expects results about association football, not American football.

The Two-Stage Architecture

This retrieve-then-rerank pattern is the dominant architecture in modern search:

┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│  Query       │────→│  BM25         │────→│  ML Re-ranker │────→ Results
│  Processing  │     │  Retrieval    │     │  (Top-K)      │
│              │     │  (Fast/Coarse)│     │  (Slow/Precise)│
└─────────────┘     └──────────────┘     └──────────────┘
    ~10ms               ~50ms                ~100ms

The first stage prioritizes speed and recall. The second stage prioritizes precision. This separation lets you use simple, fast algorithms on the full index and expensive, accurate models on a small candidate set.

8. Learning to Rank

Before neural networks took over re-ranking, the dominant approach was Learning to Rank (LTR): train a traditional ML model to predict relevance using hand-crafted features.

Feature Engineering

A typical LTR system uses 200 to 500 features per query-document pair. Some examples:

Feature Category Examples
Text match BM25 score (title), BM25 score (body), BM25 score (URL), query term coverage
Link analysis PageRank, number of inbound links, link diversity, domain authority
User signals Click-through rate, dwell time, bounce rate, pogo-sticking rate
Freshness Page age, last modified date, content change frequency
Quality Spelling errors, reading level, ad density, page load speed
Query-document Query length, document length, query-title overlap, exact match in URL

Training Data

LTR models are trained on human relevance judgments. Professional raters evaluate query-document pairs on a scale:

Grade Meaning
0 Irrelevant
1 Slightly relevant
2 Relevant
3 Highly relevant
4 Perfect match / navigational

Thousands of queries are judged, each with dozens of candidate documents. This produces training sets of millions of labeled examples. Search engines also use implicit feedback from user clicks, though click data is noisy (position bias, attractiveness bias).

Three Approaches to LTR

Pointwise: Treat each query-document pair independently. Train a regression model to predict the relevance grade. Simple, but ignores the fact that search is a ranking problem, not a regression problem.

Pairwise: Train on pairs of documents for the same query. The model learns that Document A should rank above Document B. RankSVM and RankNet are classic pairwise methods.

Listwise: Optimise a ranking metric directly over the full list of documents. LambdaMART, based on gradient boosted decision trees, is the most successful listwise approach and was the standard in production search systems for over a decade.

LambdaMART

LambdaMART combines two ideas:

  1. MART (Multiple Additive Regression Trees): An ensemble of decision trees where each tree corrects the errors of the previous trees.
  2. Lambda gradients: Instead of optimising a surrogate loss, compute gradients that directly reflect changes in NDCG (see Section 10).

The algorithm builds trees iteratively. At each step, it computes for every document pair (i, j) where document i is ranked below j but should be ranked above it: how much would swapping i and j improve NDCG? This improvement becomes the gradient signal for the next tree.

# Pseudocode for LambdaMART training
for round in range(num_trees):
    # For each query
    for query in training_queries:
        documents = get_documents(query)
        current_scores = model.predict(documents)
        current_ranking = argsort(current_scores, descending=True)
 
        # Compute lambda gradients for each document pair
        for i, j in pairs(documents):
            if label[i] > label[j]:  # i should rank above j
                delta_ndcg = compute_ndcg_delta(i, j, current_ranking)
                lambda_ij = sigmoid(score[j] - score[i]) * abs(delta_ndcg)
                gradient[i] += lambda_ij
                gradient[j] -= lambda_ij
 
    # Fit a regression tree to the lambda gradients
    tree = fit_tree(features, gradients)
    model.add_tree(tree, learning_rate=0.1)

LambdaMART is fast at inference time (evaluating a few hundred trees takes microseconds), interpretable (you can inspect which features the trees split on), and performs well with tabular features. It dominated competitions like the Yahoo Learning to Rank Challenge and was used in production at Microsoft, Yahoo, and other search engines for years.

9. Neural Retrieval: Transformers in Search

Starting around 2019, transformer-based models began to outperform traditional LTR systems. The key breakthrough was BERT (Bidirectional Encoder Representations from Transformers), which could capture deep semantic relationships between query and document text.

Cross-Encoders for Re-ranking

The simplest application of BERT in search is as a re-ranker. Concatenate the query and document, pass them through BERT, and use the [CLS] token output to predict relevance:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
 
tokenizer = AutoTokenizer.from_pretrained("cross-encoder/ms-marco-MiniLM-L-12-v2")
model = AutoModelForSequenceClassification.from_pretrained(
    "cross-encoder/ms-marco-MiniLM-L-12-v2"
)
 
query = "how does HTTPS work"
document = "HTTPS uses TLS to encrypt HTTP traffic between client and server..."
 
inputs = tokenizer(
    query, document,
    return_tensors="pt",
    max_length=512,
    truncation=True,
    padding=True
)
 
with torch.no_grad():
    logits = model(**inputs).logits
    relevance_score = logits.squeeze().item()
 
print(f"Relevance score: {relevance_score:.4f}")

Cross-encoders are accurate because they allow full attention between query and document tokens. The query word "HTTPS" can attend to the document word "TLS" and learn that they are related. But they are slow. Each query-document pair requires a full forward pass through BERT. Scoring 1,000 candidates takes seconds, not milliseconds.

Bi-Encoders for Dense Retrieval

Bi-encoders solve the latency problem by encoding queries and documents independently:

from sentence_transformers import SentenceTransformer
import numpy as np
 
model = SentenceTransformer("msmarco-distilbert-base-v4")
 
# Encode documents offline (once)
documents = [
    "HTTPS uses TLS to encrypt HTTP traffic...",
    "The Python programming language was created by...",
    "Transport Layer Security provides authentication and encryption...",
]
doc_embeddings = model.encode(documents)
 
# Encode query at search time
query = "how does HTTPS work"
query_embedding = model.encode([query])
 
# Compute cosine similarity
similarities = np.dot(doc_embeddings, query_embedding.T).squeeze()
 
for doc, sim in sorted(zip(documents, similarities), key=lambda x: -x[1]):
    print(f"{sim:.4f}: {doc[:60]}...")

Documents are encoded once and stored as dense vectors. At query time, only the query needs encoding (a few milliseconds), and retrieval becomes a nearest-neighbor search in vector space.

Approximate Nearest Neighbor (ANN) Search

With billions of document vectors, exact nearest-neighbor search is too slow. ANN algorithms trade a small amount of accuracy for massive speed improvements:

  • HNSW (Hierarchical Navigable Small World): A graph-based approach where vectors are connected in a navigable graph. Search starts at a random entry point and greedily moves toward the query vector. Used in FAISS and many production systems.
  • IVF (Inverted File Index): Partition the vector space into clusters using k-means. At query time, search only the nearest clusters.
  • Product Quantization: Compress vectors by splitting them into sub-vectors and quantizing each independently. Reduces memory by 10x to 100x while enabling fast distance computation.

A system like FAISS can search one billion 768-dimensional vectors in under 10 milliseconds on a single machine.

ColBERT: Late Interaction

ColBERT (Contextualized Late Interaction over BERT) is a compromise between cross-encoders and bi-encoders. It encodes queries and documents independently (like a bi-encoder) but retains all token embeddings instead of compressing to a single vector. At search time, it computes a lightweight interaction between query token embeddings and document token embeddings:

score(Q, D) = Σ max_j (Q_i · D_j)    for each query token i

Each query token finds its best-matching document token, and the scores are summed. This captures fine-grained term-level interactions without the cost of full cross-attention.

Cross-Encoder vs Bi-Encoder vs ColBERT

Property Cross-Encoder Bi-Encoder ColBERT
Query-document interaction Full attention None (dot product) Late interaction
Accuracy Highest Lower Between
Latency per document ~10ms ~0.01ms ~0.1ms
Can be used for retrieval No (too slow) Yes Yes
Typical use Re-ranking First-stage retrieval First-stage or re-ranking

The Shift from Term Matching to Semantic Understanding

The fundamental change is this: BM25 matches terms. If the query says "car" and the document says "automobile," BM25 gives no credit. Neural models understand that "car" and "automobile" are synonyms. They understand that "how to fix a leaky faucet" is relevant to a document titled "plumbing repair guide" even though the exact terms do not match.

Google's BERT integration (announced in 2019) was reported to affect 10% of English queries. For queries like "can you get medicine for someone pharmacy," BERT understood that "for someone" was the key phrase, meaning the user wanted to know if they could pick up a prescription on behalf of another person. A term-matching system would miss this nuance entirely.

MUM: Multitask Unified Model

Google's MUM, announced in 2021, takes the idea further. MUM is a T5-based model that is 1,000 times more powerful than BERT (by Google's own claim). It is multimodal (text and images), multilingual (75 languages), and multitask (it can answer, summarise, translate, and reason). MUM represents the direction search is heading: not just matching documents to queries, but understanding information needs at a deeper level.

10. Evaluation: How Do You Know the Results Are Good?

Building a ranking system is only half the problem. The other half is measuring whether it works.

Precision and Recall

The most basic metrics:

Precision = (relevant documents retrieved) / (total documents retrieved)
Recall    = (relevant documents retrieved) / (total relevant documents)

Precision tells you how many of your results are good. Recall tells you how many of the good results you found. There is usually a tension between the two: casting a wider net (higher recall) often means including more noise (lower precision).

Precision@K

In web search, users rarely look past the first page. Precision@10 measures the fraction of relevant documents in the top 10 results:

Precision@10 = (relevant documents in top 10) / 10

If 7 out of 10 results are relevant, Precision@10 = 0.7.

NDCG: Normalized Discounted Cumulative Gain

NDCG is the standard evaluation metric for ranked retrieval. It accounts for two things that Precision@K ignores:

  1. Graded relevance: Documents are not just "relevant" or "not relevant." A highly relevant document is worth more than a marginally relevant one.
  2. Position discount: A relevant document at position 1 is worth more than the same document at position 10.

Cumulative Gain (CG):

CG@K = Σ rel(i)    for i = 1 to K

where rel(i) is the relevance grade of the document at position i.

Discounted Cumulative Gain (DCG):

DCG@K = Σ rel(i) / log2(i + 1)    for i = 1 to K

The logarithmic discount means that position 1 gets full credit, position 2 gets credit divided by log2(3) = 1.58, position 3 gets credit divided by log2(4) = 2.0, and so on.

Normalized DCG:

NDCG@K = DCG@K / IDCG@K

where IDCG@K is the DCG of the ideal ranking (all documents sorted by relevance, best first). NDCG ranges from 0 to 1, where 1 means the ranking is perfect.

Worked Example

Suppose for a query, we have 5 results with these relevance grades (on a 0-3 scale):

Position Document Relevance Discount (1/log2(i+1)) Discounted Gain
1 Doc A 3 1.000 3.000
2 Doc B 1 0.631 0.631
3 Doc C 2 0.500 1.000
4 Doc D 0 0.431 0.000
5 Doc E 3 0.387 1.161
DCG@5 = 3.000 + 0.631 + 1.000 + 0.000 + 1.161 = 5.792

The ideal ranking would be: relevance [3, 3, 2, 1, 0]:

IDCG@5 = 3/1.000 + 3/0.631 + 2/0.500 + 1/0.431 + 0/0.387

Wait, let me recompute that correctly. The discount is 1/log2(i+1), so the discounted gain is rel(i)/log2(i+1):

IDCG@5 = 3/1.000 + 3/1.585 + 2/2.000 + 1/2.322 + 0/2.585
       = 3.000 + 1.893 + 1.000 + 0.431 + 0.000
       = 6.324
NDCG@5 = 5.792 / 6.324 = 0.916

An NDCG of 0.916 indicates a good ranking, but not perfect. The main issue is that Doc E (relevance 3) is at position 5 instead of position 2, and Doc B (relevance 1) is at position 2 instead of position 4.

Click-Through Rate and Implicit Feedback

Human relevance judgments are expensive. Search engines supplement them with implicit signals from user behavior:

  • Click-through rate (CTR): If users consistently click the third result instead of the first, the third result might deserve a higher rank.
  • Dwell time: Clicking a result and staying for 5 minutes suggests satisfaction. Clicking and immediately returning (pogo-sticking) suggests dissatisfaction.
  • Reformulation: If a user immediately searches for a rephrased query, the original results were probably not satisfying.

These signals are noisy. Position bias means users click higher results simply because they see them first. Attractive snippets can inflate CTR for poor documents. But at scale, these biases can be modeled and corrected.

Interleaving Experiments

The gold standard for evaluating search quality is interleaving: show users results from two different ranking systems mixed together, then observe which system's results get more clicks. This is more sensitive than traditional A/B testing because both systems are compared on the same queries for the same users, reducing variance.

The TREC Evaluation Conferences

The Text REtrieval Conference (TREC), organized by NIST, has been running annual evaluation campaigns since 1992. Researchers submit their systems, which are evaluated on shared test collections with pooled relevance judgments. TREC tracks have covered web search, question answering, medical search, deep learning-based retrieval, and many other domains. Much of what we know about retrieval effectiveness comes from TREC evaluations.

11. Building a Search Engine with Elasticsearch

Let us put the theory into practice. Elasticsearch, built on Apache Lucene, is the most widely deployed open-source search engine. It uses BM25 by default, supports distributed indexing and querying, and provides a rich query DSL.

Creating an Index

First, define an index with custom mappings and analyzer settings:

PUT /articles
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "snowball"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "custom_analyzer",
        "boost": 2.0
      },
      "body": {
        "type": "text",
        "analyzer": "custom_analyzer"
      },
      "url": {
        "type": "keyword"
      },
      "published_date": {
        "type": "date"
      },
      "domain_authority": {
        "type": "float"
      },
      "tags": {
        "type": "keyword"
      }
    }
  }
}

The custom_analyzer applies three token filters: lowercase (normalizes case), stop (removes common words like "the," "is," "at"), and snowball (stems words so "running" and "runs" match "run"). The title field gets a boost of 2.0, meaning a match in the title is worth twice as much as a match in the body.

Indexing Documents

POST /articles/_doc/1
{
  "title": "Understanding TLS 1.3 and HTTPS",
  "body": "Transport Layer Security version 1.3 is the latest version of the protocol that secures HTTPS connections. It reduces handshake latency from two round trips to one, removes support for legacy cipher suites, and encrypts more of the handshake itself...",
  "url": "https://example.com/tls-1-3-explained",
  "published_date": "2026-03-15",
  "domain_authority": 72.5,
  "tags": ["security", "networking", "protocols"]
}
 
POST /articles/_doc/2
{
  "title": "A Beginner's Guide to Web Security",
  "body": "Web security encompasses many topics including HTTPS, content security policy, cross-site scripting prevention, and authentication best practices...",
  "url": "https://example.com/web-security-guide",
  "published_date": "2026-01-20",
  "domain_authority": 45.0,
  "tags": ["security", "web"]
}

Basic BM25 Query

GET /articles/_search
{
  "query": {
    "multi_match": {
      "query": "how does HTTPS work",
      "fields": ["title^2", "body"],
      "type": "best_fields"
    }
  }
}

The multi_match query searches across both title and body. The ^2 on title means title matches are boosted by a factor of 2. The best_fields type uses the score from the best-matching field (rather than summing across fields).

Combining Text Search with Filters and Function Scores

In practice, you rarely use BM25 alone. You combine it with filters (to restrict results to a subset) and function scores (to incorporate non-text signals):

GET /articles/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": {
            "multi_match": {
              "query": "HTTPS TLS encryption",
              "fields": ["title^2", "body"]
            }
          },
          "filter": [
            { "term": { "tags": "security" } },
            { "range": { "published_date": { "gte": "2025-01-01" } } }
          ]
        }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "domain_authority",
            "factor": 0.1,
            "modifier": "log1p",
            "missing": 1
          }
        },
        {
          "gauss": {
            "published_date": {
              "origin": "2026-04-04",
              "scale": "90d",
              "decay": 0.5
            }
          }
        }
      ],
      "score_mode": "multiply",
      "boost_mode": "multiply"
    }
  }
}

This query does several things:

  1. Text matching: BM25 on "HTTPS TLS encryption" across title and body.
  2. Filtering: Only documents tagged "security" and published after 2025-01-01.
  3. Domain authority boost: Uses log1p of the domain_authority field as a multiplier. A document with domain_authority 80 gets a bigger boost than one with 20.
  4. Freshness decay: A Gaussian decay function centered on today's date with a 90-day scale. Documents published recently score higher than older ones. A document from 90 days ago gets half the freshness boost of one published today.

The score_mode: multiply means the two function scores are multiplied together. The boost_mode: multiply means the combined function score is multiplied with the BM25 text score.

Faceted Search

Elasticsearch also supports aggregations, which power faceted navigation (the sidebar filters you see on e-commerce sites):

GET /articles/_search
{
  "query": {
    "match": { "body": "search engine ranking" }
  },
  "aggs": {
    "by_tag": {
      "terms": {
        "field": "tags",
        "size": 10
      }
    },
    "by_year": {
      "date_histogram": {
        "field": "published_date",
        "calendar_interval": "year"
      }
    }
  }
}

This returns both the search results and aggregation counts: how many matching documents exist for each tag and each publication year. The user can then click a tag to narrow their search.

BM25 Tuning in Elasticsearch

Elasticsearch lets you customize the BM25 parameters per field:

PUT /articles
{
  "settings": {
    "index": {
      "similarity": {
        "custom_bm25": {
          "type": "BM25",
          "k1": 1.5,
          "b": 0.80
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "body": {
        "type": "text",
        "similarity": "custom_bm25"
      }
    }
  }
}

Increasing k1 from the default 1.2 to 1.5 means term frequency saturates more slowly (more credit for repeated terms). Increasing b from 0.75 to 0.80 means stronger length normalization (longer documents are penalized more). The right values depend on your corpus. Short, uniform-length documents (like tweets) benefit from low b. Heterogeneous collections (like web pages ranging from 100 to 100,000 words) benefit from higher b.

12. Putting It All Together: The Modern Search Stack

A modern search engine is not one algorithm. It is a stack of systems, each solving a different part of the problem:

┌─────────────────────────────────────────────────┐
│                   User Query                     │
├─────────────────────────────────────────────────┤
│  Query Understanding                             │
│  (spell check, expansion, intent, NER)           │
├─────────────────────────────────────────────────┤
│  Stage 1: Sparse Retrieval (BM25)                │
│  + Stage 1b: Dense Retrieval (bi-encoder ANN)    │
│  → Candidate set: ~1000 documents                │
├─────────────────────────────────────────────────┤
│  Stage 2: Re-ranking (cross-encoder or LTR)      │
│  → Re-ranked top ~100 documents                  │
├─────────────────────────────────────────────────┤
│  Stage 3: Business logic                         │
│  (diversity, freshness, dedup, personalization)   │
├─────────────────────────────────────────────────┤
│  Snippet generation + result rendering            │
├─────────────────────────────────────────────────┤
│                Search Results Page                │
└─────────────────────────────────────────────────┘

The trend in 2025 and 2026 is toward hybrid retrieval: running both sparse (BM25) and dense (vector) retrieval in parallel, merging the candidate sets, then re-ranking with a transformer model. This captures the best of both worlds. BM25 excels at exact term matching (product names, error codes, proper nouns). Dense retrieval excels at semantic matching (synonyms, paraphrases, conceptual similarity).

Scoring Approaches Compared

TF-IDF BM25 LambdaMART Neural (Cross-Encoder)
Type Unsupervised Unsupervised Supervised Supervised
Input Term frequencies Term frequencies 200+ features Raw text
Semantic understanding None None Indirect (via features) Deep
Speed Very fast Very fast Fast Slow
Training data needed None None Millions of judgments Millions of pairs
Handles synonyms No No Only via features Yes, natively
Length normalization Weak Strong Via features Implicit
Typical use Legacy systems First-stage retrieval Second-stage re-ranking Final re-ranking

What Google Actually Does

Google does not publish its full ranking system, but from papers, patents, and public statements, we know the broad strokes:

  1. Crawling: Googlebot, operating from data centres worldwide, crawls billions of pages daily. Priority is determined by PageRank, update frequency, and other signals.
  2. Indexing: The Google Search index is built on a custom system (not Lucene). It uses a distributed inverted index called Caffeine, which supports near-real-time indexing.
  3. Retrieval: BM25-like sparse retrieval combined with dense retrieval using neural embeddings.
  4. Ranking: A multi-stage pipeline. The first stage uses fast, simple models. Subsequent stages use increasingly expensive models, culminating in large transformer-based re-rankers.
  5. BERT integration: Since 2019, BERT models have been used to understand query intent and document relevance, particularly for long, conversational queries.
  6. MUM: Deployed for specific tasks like identifying information gaps across languages and understanding complex, multi-part queries.
  7. RankBrain: A machine learning system (predating BERT) that helps interpret unfamiliar queries by relating them to known query patterns.
  8. SpamBrain: A neural system for detecting spam and link manipulation, replacing many of the older rule-based spam filters.

The exact number of ranking signals is unknown. Google has mentioned "hundreds" of signals. These include text match scores, link analysis, user engagement, content quality, page experience (Core Web Vitals), mobile-friendliness, HTTPS usage, and many more.

Conclusion

Search engine ranking is one of those problems that looks simple on the surface and reveals extraordinary depth once you start digging. The journey from a basic TF-IDF bag-of-words model to a hybrid sparse-dense retrieval system with transformer re-rankers spans decades of research in information retrieval, distributed systems, and machine learning.

The core ideas are durable. Inverted indexes, first described in the 1960s, remain the backbone of text retrieval. BM25, published in the 1990s, is still the default first-stage scoring function in most search systems. PageRank, from 1998, established link analysis as a fundamental ranking signal. These are not legacy technologies waiting to be replaced. They are proven foundations that neural methods build on top of.

What has changed is the depth of understanding that search engines bring to both queries and documents. BM25 matches terms. BERT understands meaning. The gap between those two capabilities is the gap between a user searching for "how to fix a broken screen" and matching only documents containing those exact words, versus understanding that a document about "smartphone display repair guide" is equally relevant.

If you are building a search system today, the practical advice is: start with Elasticsearch and BM25, get your data pipeline right, measure with NDCG, then layer on dense retrieval and neural re-ranking where the metrics justify the complexity. The fundamentals will carry you further than you expect, and the fancy models only shine when the foundation is solid.

The web keeps growing. Queries keep getting more complex. The ranking problem is not solved. It is just getting more interesting.