← Back to Logs

How Database Indexes Actually Work

Try the interactive lab for this articleTake the quiz (6 questions · ~5 min)

Database indexes are often taught as a simple idea: they are like the index at the back of a book, and they make queries faster. That metaphor is useful for about thirty seconds. Real database indexes are mutable disk backed data structures living under concurrency, crash recovery, page splits, vacuum or purge, write ahead logging, and a cost based planner that sometimes chooses not to use them at all. If you want to understand why an index helps one query and hurts another, the book metaphor is no longer enough.

In practice an index is a separate structure that lets the database find rows without scanning the whole table. The structure is usually a B tree or one of its close relatives because B trees are built for block storage. They keep the tree shallow, pack many keys per page, and support ordered traversal as well as point lookup. PostgreSQL, InnoDB, SQLite, SQL Server, and Oracle therefore all rely heavily on B tree family structures.

This article explains how those indexes are laid out, how lookups work, how inserts trigger splits, what clustered versus secondary indexes really mean, why index only scans are not always truly heap free, and why every extra index makes writes slower. The point is to get from "add an index" to "understand what storage and planner behaviour you are buying when you add one".

The Problem A Table Scan Has To Solve

Imagine a table of customer invoices:

CREATE TABLE invoices (
    id BIGINT PRIMARY KEY,
    customer_id BIGINT NOT NULL,
    created_at TIMESTAMP NOT NULL,
    status TEXT NOT NULL,
    total_cents BIGINT NOT NULL
);

If you run:

SELECT * FROM invoices WHERE customer_id = 48291;

and there is no index on customer_id, the database may need to inspect every row. On a small table this is fine. On a table with hundreds of millions of rows, it means reading many pages from disk or cache just to discard almost all of them.

The core problem is not comparing one value. Comparing is cheap. Finding where to compare is expensive. Storage is organised into pages, and rows are spread across those pages according to insertion history, clustering policy, and vacuum side effects. Without a supporting structure, the database does not know which subset of pages contain the target rows.

An index exists to answer that routing problem.

Why B Trees Match Disk And SSD Storage Well

A binary search tree in memory branches on each comparison and can grow deep. That is fine for pointers in RAM. On storage it is awful because each level may require another page access.

B trees solve this by using wide nodes. Instead of two children per node, a page may contain hundreds of keys and pointers. That keeps the tree shallow even for very large tables.

Suppose a page can hold 300 separator keys. Then a three level tree can address roughly:

300 * 300 * 300 = 27,000,000

leaf positions, and a four level tree can address billions. This is why real database indexes stay shallow. Even large tables often need only a handful of page reads from root to leaf.

The structure typically has:

  • a root page
  • zero or more internal pages
  • many leaf pages

Internal pages hold separator keys and child pointers. Leaf pages hold actual key entries and row references, or the rows themselves in clustered designs.

A Page Is The Unit That Really Matters

Databases do not manipulate indexes entry by entry at the storage level. They manipulate pages, often 8 KiB or 16 KiB. The page is the unit of IO, locking granularity in some engines, and write ahead logging.

A simplified leaf page contains:

  • page header
  • slot directory
  • key entries
  • free space region
  • sibling pointer for leaf to leaf navigation

The slot directory lets entries move inside the page without changing logical slot references. This matters because variable length keys and row pointers rarely fit a perfectly rigid layout.

A highly simplified view:

+----------------------+
| page header          |
| slot dir             |
| slot 1 -> key A      |
| slot 2 -> key B      |
| slot 3 -> key C      |
| free space           |
| key A payload        |
| key B payload        |
| key C payload        |
+----------------------+

Understanding that pages are real units of storage explains several later behaviours:

  • page splits are expensive because they rewrite structural metadata
  • random inserts fragment page free space
  • fillfactor settings trade density for future insert headroom
  • index bloat is page level slack accumulating over time

The page model also explains why engine settings matter. A larger page can hold more separator keys, which can reduce tree depth, but it also means each split moves more data and each cache miss pulls in a larger chunk. Compression changes fan out again. Prefix compression in internal nodes can make the tree shallower for long textual keys. These are storage design choices, not small implementation details.

Heap Pages And Row Identifiers Are The Other Half Of The Lookup

An index only makes sense relative to the base table storage. In a heap organised table, the table itself is stored as data pages containing tuples in whatever physical order history has produced. The index therefore does not usually point to "row number 48291". It points to a physical location such as a tuple identifier or record pointer.

That indirection is why heap lookups can become random IO heavy. If the index says matching tuples live on heap pages 8, 104, 900, and 1402, the executor may need to jump all over the relation to fetch them. When those heap pages are already in cache this is manageable. When they are not, latency grows quickly.

This also explains why clustering a table on an index can help certain workloads. If rows with nearby index keys are also physically nearby in the heap, the follow up table reads become much less scattered. Databases are not only comparing keys. They are navigating physical storage.

Point Lookup Walks Root To Leaf

If there is an index on customer_id, a lookup roughly works like this:

  1. start at the root page
  2. compare the target key against separator keys
  3. choose the child pointer whose key range contains the target
  4. repeat until a leaf page is reached
  5. search inside the leaf for matching entries

Because pages contain many sorted keys, each step narrows the search range dramatically. Once at the leaf, the engine either:

  • follows row pointers to the heap or base table
  • or returns the row directly if the index is clustered and stores row data in leaf order

That point lookup is the basic win of the index. Instead of touching thousands of table pages, the engine may touch one root page, one or two internal pages, one leaf page, and then a small number of data pages.

Range Scans Are Why Order Matters

B tree indexes are not only for equality lookups. Their leaves are ordered, often with sibling links between adjacent leaf pages. That makes range scans efficient:

SELECT id, created_at
FROM invoices
WHERE created_at >= '2026-04-01'
  AND created_at <  '2026-05-01'
ORDER BY created_at;

If there is an index on created_at, the database can:

  1. find the first matching leaf entry
  2. walk forward through leaf pages in order
  3. stop when the upper bound is passed

This is much better than scanning and sorting the whole table. It is also why index key order matters. An index on (status, created_at) helps some queries and not others depending on whether the search predicates constrain the leftmost prefix.

Composite Indexes Follow Left To Right Prefix Rules

Suppose we create:

CREATE INDEX idx_invoices_status_created
ON invoices (status, created_at);

The leaf keys are ordered first by status, then by created_at within each status bucket. This helps:

WHERE status = 'paid'
  AND created_at >= ...

because the engine can jump into the paid region and then run a range scan on time.

It is much less useful for:

WHERE created_at >= ...

because the rows are not globally ordered by created_at. They are grouped by status first. This left prefix rule is one of the most important practical index design constraints, and it falls directly out of how keys are physically ordered in the B tree.

It also explains why adding one more column to an index is not automatically harmless. That extra column may make one report query faster while making the whole structure wider and less cache friendly for all other workloads that touch it.

Cardinality Estimates Decide Whether The Planner Trusts The Index

An index can exist, be structurally healthy, and still lose to a table scan because the planner believes too many rows will match. That belief comes from statistics.

Typical statistics include:

  • estimated distinct value count
  • most common values
  • histogram buckets
  • null fraction
  • correlation with physical table order

If status = 'paid' matches 92 per cent of the table, an index on status may be close to useless for that predicate alone. If customer_id = 48291 matches three rows, the same style of lookup is highly selective and the index becomes very attractive.

When statistics are stale, plan quality suffers. The planner is a cost estimator, not a mind reader. If yesterday the paid status covered 10 per cent of rows and today it covers 90 per cent, the old estimate may keep pointing the optimiser toward a bad access path until statistics are refreshed.

Clustered And Secondary Indexes Change What The Leaf Holds

Different engines store different things in leaf pages.

In InnoDB, the primary key is clustered. The leaf pages of the primary key B tree contain the full row. Secondary indexes store their indexed columns plus the primary key value, not a physical heap pointer. A secondary lookup therefore often means:

  1. search the secondary index
  2. read the primary key from the secondary leaf entry
  3. search the clustered primary index to fetch the row

In PostgreSQL, tables are heaps and B tree indexes store key values plus tuple identifiers pointing into heap pages. That means a normal index scan often touches:

  1. index pages
  2. heap pages

This difference matters. InnoDB's clustered design can make primary key range scans locality friendly. PostgreSQL's heap plus separate index design allows more flexible physical row storage and HOT updates, but ordinary index lookups may need extra heap access.

Bitmap And Combined Index Plans Sit Between Pure Scan Types

The planner is not limited to choosing either one simple index scan or one sequential table scan. Some engines, especially PostgreSQL, can combine multiple index results into a bitmap structure and then visit heap pages in a more ordered way.

For example, if there are separate indexes on customer_id and status, a query with both predicates may:

  1. scan each index separately
  2. build a bitmap of matching row locations
  3. visit the heap pages in page order

This can outperform a naive index scan when many rows match but the engine still wants to avoid a full table sweep. It is another reminder that indexes participate in a planner search space rather than acting as a hard on or off switch.

Covering Indexes Reduce Base Table Access

A covering index contains all the columns needed by a query, either as key columns or included payload columns. If the planner can satisfy the query from the index alone, it may skip many base table reads.

Example:

CREATE INDEX idx_invoice_customer_created
ON invoices (customer_id, created_at)
INCLUDE (status, total_cents);

Then:

SELECT created_at, status, total_cents
FROM invoices
WHERE customer_id = 48291
ORDER BY created_at;

may be satisfied directly from the index.

This is powerful because index pages are narrower than full table rows and sorted exactly for the access pattern. The trade off is index size. Every included column makes the index wider, reducing fan out and increasing write cost.

In PostgreSQL, index only scans also depend on visibility map state. Even if all required columns are in the index, the executor may still check the heap unless it knows the heap page is all visible to current transactions.

Fillfactor And Page Density Are Deliberate Tuning Choices

If every page were packed to absolute maximum density, random inserts would trigger page splits constantly. Many engines therefore leave some free space intentionally, either through default behaviour or explicit settings such as fillfactor.

The trade off is direct:

  • denser pages mean smaller indexes and better cache efficiency right now
  • looser pages mean more headroom for future inserts and updates

On a mostly append only table, very dense pages may be fine. On a table with frequent mid key inserts, leaving free space can reduce structural churn significantly. Storage engines are therefore constantly balancing current compactness against future write cost.

Page density also interacts with key width. A wide composite index with long text columns fits far fewer entries per page than a narrow integer key. Fewer entries per page mean lower fan out, more pages, and potentially greater tree depth. That is one reason expression and covering indexes must justify themselves carefully.

Inserts Cause Page Splits

B trees stay balanced by splitting pages when they fill. If a leaf page has no room for a new key, the engine:

  1. allocates a new page
  2. moves roughly half the entries to the new page
  3. inserts a separator key into the parent
  4. possibly splits the parent if it also fills

This preserves search correctness, but it is expensive compared with inserting into free space on an existing page. Random insert patterns into non monotonically increasing keys create many scattered splits and more fragmented storage.

Sequential keys behave differently. If the inserted key is always growing, such as an increasing surrogate ID, most inserts land near the rightmost leaf. That can reduce random splits, though it may create hot page contention in some workloads.

This is one reason UUID version choice and primary key design affect database behaviour beyond semantics. A fully random key spreads writes. A monotonic key concentrates them. Each pattern has consequences for splits, cache locality, and concurrency hotspots.

Deletes And Updates Leave Structural Slack

Indexes do not automatically become compact after deletes. Deleting entries frees logical space, but pages may remain partially empty. Repeated churn can therefore create index bloat.

Updates can be even trickier. In PostgreSQL, if an indexed column changes, new index entries are created for the new tuple version. Old entries remain until vacuum can clean them. In InnoDB, updates may modify clustered layout or secondary entries depending on which columns changed.

Over time this means:

  • pages with low useful density
  • larger on disk index size
  • more pages to read for the same logical key range
  • worse cache efficiency

This is why maintenance operations exist:

  • VACUUM and REINDEX in PostgreSQL
  • purge and rebuild operations in InnoDB and other engines

The index is not just a static lookup map. It is a living structure constantly adjusted under transactional rules.

Different engines expose this through different maintenance language. PostgreSQL users see vacuum, visibility maps, and reindex operations. InnoDB users see purge threads, page merge behaviour, and online rebuild features. The tooling differs, but the underlying reality is shared: long term churn leaves partially used pages behind, and partially used pages waste memory and IO.

MVCC Makes Visibility Part Of The Access Cost

On multi version systems, reading a row is not only about finding its physical location. The executor may also need to decide whether the row version is visible to the current transaction snapshot.

This is especially visible in PostgreSQL. An index entry may point to a heap tuple, but the executor still checks whether that tuple version is visible. If the corresponding heap page is marked all visible in the visibility map, an index only scan can often avoid heap access. If not, the heap still has to be consulted even when all requested columns are present in the index.

That means MVCC state directly affects whether a theoretically covering plan is truly heap free. Readers often assume index only scan means "the heap is never touched". In practice the answer depends on visibility metadata and recent write churn.

Partial And Expression Indexes Materialise Specific Access Patterns

Not every useful index covers every row in raw column form. Some engines support partial indexes, which store only rows matching a predicate, and expression indexes, which store the result of a computed expression.

Examples:

CREATE INDEX idx_paid_invoices
ON invoices (created_at)
WHERE status = 'paid';
 
CREATE INDEX idx_lower_email
ON customers ((lower(email)));

The first index is smaller because it omits rows outside the paid subset. The second supports case insensitive lookup without recomputing the expression on every scan. Both are powerful because they materialise a specific access pattern rather than blindly indexing raw base columns.

The cost is complexity. The planner must prove the query predicate matches the partial index condition, or that the search expression matches the indexed expression exactly enough to use the structure safely. If that proof fails, the index may exist and still be unusable for the query.

The Planner May Ignore The Index On Purpose

Users often ask why the database "is not using the index". The answer is frequently that an index is not actually cheaper for that query.

If a predicate matches a large fraction of the table, using the index may mean:

  1. scanning many index pages
  2. following many pointers into table pages
  3. paying random access cost repeatedly

At some point a sequential table scan becomes cheaper, especially on modern storage where sequential reads are efficient and the data may already be cached.

The planner estimates selectivity using statistics such as:

  • value histograms
  • null fractions
  • distinct count estimates
  • correlation between physical order and key order

When those estimates are wrong, the planner may choose badly. But the principle is sound. An index is not always faster. It is faster when it lets the engine avoid enough work to compensate for its own traversal and lookup cost.

Correlation statistics matter here as well. If heap rows are stored in an order close to the index key, following index entries back to the base table may still touch nearby pages and remain efficient. If the physical table order has drifted badly, the same logical plan can degrade into expensive scattered reads.

Write Cost Is The Price Of Every Extra Index

Indexes speed reads by adding write work. On insert, update, or delete, every affected index must also be updated. That means:

  • extra page reads and writes
  • more WAL volume
  • more page splits
  • more memory pressure in the buffer cache
  • longer commit paths under heavy write workloads

If a table has six indexes and an insert touches all of them, one logical row write becomes many physical structure updates. This is why "just index everything" is bad advice. Every index must justify its read side benefit against its write side cost.

This also shows up outside the primary database host. More index maintenance means more WAL or redo volume, more bytes streamed to replicas, more archive storage, and longer recovery replay. Read acceleration on the primary frequently becomes write amplification everywhere else in the system.

B Trees Are Dominant, But They Are Not The Only Index Type

B trees win the default role because they handle equality, range scans, and ordered traversal well. But database engines often offer other structures for specialised workloads:

  • hash indexes for direct equality lookup
  • GIN style inverted indexes for arrays, documents, or full text terms
  • GiST and R tree families for geometric or spatial searches
  • BRIN style summarising indexes for very large naturally ordered tables

The existence of these alternatives is useful context because it explains why B tree rules are powerful but not universal. If the workload is "find rows whose JSON document contains this key" or "find shapes intersecting this bounding box", a standard B tree may not match the predicate well at all. Good database design therefore starts with access pattern analysis, not with the assumption that every query should map to one ordinary B tree.

Concurrency Control Reaches Into The Index Structure Too

Indexes are updated under concurrent transactions, so the storage engine has to coordinate access carefully. During insertion, one session may be splitting a page while another is scanning the same branch. During uniqueness checks, two sessions may race to insert the same key. During deletion and vacuum, pages may be reclaimable but still visible to some transactions.

Different engines solve this with different locking and visibility schemes, but the common requirement is the same: readers must navigate a structurally changing tree without seeing corruption, and writers must preserve key ordering and transactional correctness.

That is another reason B tree implementation code is far more complicated than the classroom data structure. A textbook tree rarely has to deal with crash recovery records, concurrent page splits, version visibility, or long running snapshots. A production database index does all of those at once.

Unique Indexes Enforce Constraints Through Structure

A unique constraint is usually implemented with a unique index. During insert or update, the engine probes the index to ensure no conflicting key already exists. Because the structure is ordered and narrow, this check is efficient.

Concurrency complicates it. Two sessions might try to insert the same key at the same time. The engine must therefore combine the index lookup with appropriate locking or conflict detection so only one wins. Uniqueness is not just a planner feature. It is part of the transactional correctness story.

A Lookup Walk Through Shows Why Selective Indexes Feel Fast

Suppose the invoices table has 200 million rows and a B tree index on (customer_id, created_at). A query asks for the ten newest invoices for one customer:

SELECT id, created_at, total_cents
FROM invoices
WHERE customer_id = 48291
ORDER BY created_at DESC
LIMIT 10;

Without the index, the engine may need to scan huge parts of the table, filter rows, sort the matches, and then take ten. With the composite index, the work changes dramatically.

The executor can:

  1. descend from root to the leaf region for customer_id = 48291
  2. walk that leaf range in descending created_at order
  3. stop after ten qualifying entries
  4. fetch the base rows if needed

That is what makes the query feel fast. It is not because comparing integers is magically accelerated. It is because the database has already materialised the ordering that the query wants. The executor navigates directly to the right slice of the key space and stops early.

The LIMIT is crucial here. Ordered indexes become especially valuable when they let the engine avoid touching the rest of the matching range. If the query asked for every invoice that customer ever had, the index would still help, but the difference would be smaller because much more of the leaf range would still need to be traversed.

An Insert Walk Through Shows Where The Write Cost Comes From

Now take an insert:

INSERT INTO invoices (id, customer_id, created_at, status, total_cents)
VALUES (900000001, 48291, NOW(), 'pending', 12900);

If the table has:

  • a primary key index on id
  • a secondary index on customer_id
  • a composite index on (customer_id, created_at)
  • a partial index on pending invoices

then one logical row insert fans out into several physical actions.

The engine has to:

  1. locate the target leaf page in each relevant index
  2. confirm uniqueness where required
  3. insert the new key entry or split the page if it is full
  4. write WAL or redo records for each structural change
  5. dirty the affected buffer pages so they will later flush to storage

That is the hidden cost behind "the query got faster after I added indexes". Reads became cheaper because writes became more expensive and because more storage structures now have to stay in sync with the table. On a write heavy workload this matters more than many teams expect.

The effect is even clearer on replicas. Every extra index update becomes extra redo or WAL volume shipped and replayed elsewhere. One decision on the primary data model propagates into storage, network, and recovery behaviour across the whole database fleet.

Maintenance Work Exists Because Indexes Age

An index that was ideal on the day it was created may become less ideal months later. Workload shape changes. Data distribution shifts. Deletes and updates create bloat. Correlation with the heap weakens. Statistics go stale. Put differently, indexes age.

Production databases therefore need maintenance tasks and review loops rather than one time schema tuning. Typical maintenance work includes:

  • refreshing planner statistics
  • vacuuming dead tuples and obsolete index entries
  • rebuilding bloated indexes
  • dropping indexes that are no longer used
  • adding narrower or more selective replacements when query patterns change

This is not cosmetic housekeeping. A bloated index consumes more cache, more disk, and more replay bandwidth on replicas. A stale index that no important query uses still taxes every write. Good index management is therefore partly a storage hygiene practice and partly a workload review practice.

Teams often discover this late. They add indexes aggressively during incidents, keep them forever, and then wonder why writes are slower and storage has grown. The answer is often not one broken query. It is years of accumulated structures that were each locally reasonable and globally expensive.

Hot And Cold Data Behave Very Differently Under Indexing

Not every table has one stable access pattern. Some tables are append heavy and mostly queried for recent rows. Others are updated frequently across their full key space. Others are archival and read in large batches. The right indexing strategy depends heavily on which of those regimes the table actually lives in.

For recent hot data:

  • ordered indexes on time or surrogate identifiers often work well
  • BRIN or partition aware strategies may become attractive at very large scale
  • page locality can be strong because writes cluster naturally

For churn heavy data:

  • secondary indexes pay a larger ongoing maintenance cost
  • HOT style optimisations matter if the engine supports them
  • bloat and fragmentation accumulate faster

For cold archival data:

  • large range scans may dominate
  • sequential scan economics improve
  • some highly selective secondary indexes may no longer justify their write side cost if the table is mostly immutable

This is why blanket advice such as "always index foreign keys" or "always add a covering index for every dashboard query" eventually fails. A good index is one that matches the actual thermal profile and access pattern of the data, not one that sounds generally useful in isolation.

Partitioning Changes How Index Design Scales

Once tables grow very large, many systems partition them by time range, customer, region, or some other key. Partitioning changes the indexing problem because the planner may be able to prune whole partitions before it even consults local indexes.

That can be powerful. If invoices are partitioned by month and the query only touches April 2026, the engine may ignore the other partitions entirely. At that point the local index on each remaining partition has much less work to do.

But partitioning also multiplies structure count:

  • each partition may need its own local indexes
  • maintenance now happens per partition
  • statistics quality matters at both partition and global planning levels

A partitioned design can therefore improve scale while increasing operational complexity. You are no longer maintaining one B tree. You are maintaining many B trees plus the planning logic that decides which of them matter for each query.

This again reinforces the main theme of the article. An index is not an isolated hint. It is part of a larger storage strategy that includes physical layout, maintenance policy, statistics, and workload shape.

The Buffer Cache Decides Whether An Index Feels Fast Or Theoretical

An index is a storage structure, but user visible performance depends heavily on whether its hot pages stay in memory. If the root page, upper internal pages, and frequently touched leaf regions live in the buffer cache, lookups can feel extremely cheap. If those pages are constantly evicted under memory pressure, the same logical design may degrade sharply.

This is one reason narrow, selective indexes often perform so well. They fit more useful entries per page and therefore keep more of the active search surface resident in memory. Wide covering indexes can still be worthwhile, but they consume more cache for the same number of logical keys.

The buffer cache perspective also explains why one database can report excellent indexed performance in staging and disappointing performance in production. In staging, the active working set may fit comfortably in memory. In production, competing tables, larger scans, reporting jobs, and write churn may constantly evict index pages that the hot path expected to find resident.

An index is therefore not only a disk structure. It is also a claim on memory. When teams design many large secondary indexes, they are deciding not just what exists on disk but what they hope the buffer manager will keep warm under real traffic.

Index Order Influences Sort Avoidance And Merge Strategy

People often think about indexes only in terms of filtering, but ordered indexes are just as important for avoiding explicit sort work.

If a query requests:

SELECT id, created_at
FROM invoices
WHERE customer_id = 48291
ORDER BY created_at DESC
LIMIT 20;

then an index ordered by (customer_id, created_at) can satisfy both the filter and the order. The executor may avoid a separate sort node entirely. That matters because sorting large result sets is expensive in CPU, memory, and sometimes temporary disk spill.

This also feeds into join planning. If two relations can be produced in compatible sorted order, merge joins become more attractive. The value of an index is therefore not limited to point lookup speed. It can change the entire operator tree the planner considers viable.

Index design often has to start from actual query shapes, including ORDER BY, GROUP BY, and join predicates, not only from the WHERE clause. The database is trying to minimise total work across filtering, ordering, joining, and row fetching. Good indexes help the whole path, not just the first predicate.

Foreign Keys, Joins, And Referential Workloads Change The Cost Balance

Indexes are especially important on join keys and foreign keys because relational workloads spend so much time connecting tables rather than reading one table in isolation.

Suppose invoice_items.invoice_id references invoices.id. Without a supporting index on the referencing side or the frequently joined side, common operations become painful:

  • joining parent and child tables
  • deleting or updating parent rows under foreign key checks
  • aggregating child rows by parent

This is one reason schema design and index design cannot be separated cleanly. Referential structure creates predictable access paths. If the application constantly asks for one invoice and all of its items, the supporting child side index is not an optional micro optimisation. It is part of the relational design.

At the same time, blindly indexing every foreign key without regard to workload can still overshoot. Some relations are tiny. Some tables are archival. Some join patterns never occur on the hot path. The same principle still holds: the index must justify its maintenance cost under the real workload.

Storage Engine Differences Mean The Same SQL Advice Can Mislead

A lot of indexing advice is stated as if every relational database behaved the same. In practice the storage engine changes important details:

  • PostgreSQL heap plus visibility model behaves differently from InnoDB clustered storage
  • SQLite page and journalling behaviour differs again
  • SQL Server and Oracle have their own page management and planner habits

So a rule that is sensible in one engine can be incomplete in another. "Make it a covering index" means one thing when included columns are cheap and visibility rules still require heap checks, and another thing in an engine where the clustered structure already carries row data differently. "Cluster by primary key" also means different things depending on whether the table is inherently heap organised or not.

This does not make indexing arbitrary. It means the mental model must stay grounded in the actual engine implementation. SQL syntax can look portable while storage behaviour is not. Serious database work always returns from the query text back down to page layout, MVCC, cache behaviour, and planner rules for the specific engine in use.

A Bad Index Can Be Correct And Still Be Harmful

One of the most important practical lessons in database work is that an index can be logically valid and still make the system worse. If a query runs slightly faster after adding an index, that does not automatically mean the index is worth keeping.

A harmful index may:

  • overlap heavily with a better existing index
  • serve only a rare report query while taxing every insert
  • be so low selectivity that the planner almost never uses it
  • duplicate ordering already provided by a clustered structure
  • consume cache and storage that more valuable indexes need

This is why mature teams review index usage over time rather than treating schema additions as permanent victories. The relevant question is not "did this index ever help one query". The relevant question is "does this structure earn its write, memory, and maintenance cost across the workload that actually exists now".

Seen this way, index design starts to resemble budgeting. Every new index spends:

  • disk space
  • buffer cache share
  • WAL or redo bandwidth
  • write latency headroom

If the return on that spend is weak, the index is not neutral. It is actively stealing budget from more valuable structures.

Query Shape Often Matters More Than Table Size

Developers often ask whether a table is "big enough to need an index". That is the wrong first question. A modest table with an unfortunate hot path query can benefit greatly from an index. A very large table can still justify a sequential scan for broad analytical reads.

What matters first is query shape:

  • is the predicate highly selective
  • does the query need ordered output
  • is there a small LIMIT
  • are joins forcing repeated lookup by one key
  • are only a few columns needed

A small but latency sensitive lookup table serving a web request on every page load may need excellent indexing because the query sits on the critical path. A huge archival table read once nightly for a batch report may not need additional indexes at all if the access pattern is naturally sequential.

This is another reason the README style of deep explanation matters here. "Indexes are for big tables" sounds convenient, but it is a poor engineering rule. Indexes are for access patterns that benefit enough from ordered navigation to justify the extra structure.

Partition Locality And Time Based Data Often Change The Best Choice

Operational databases frequently collect time series like records: invoices, events, payments, jobs, logs, measurements. Those workloads often have strong temporal locality. Most reads concern recent ranges. Most writes append near the newest range. Older partitions become colder and more stable.

That pattern changes index economics. A recent partition may justify several carefully chosen indexes because the hot operational queries need them. Older partitions may need fewer indexes, denser storage, or even different maintenance policy because they are read rarely and updated almost never.

This is one reason time partitioning and retention policy can be more effective than endlessly tuning one giant monolithic B tree. Instead of forcing one structure to serve hot writes, hot reads, and cold history all at once, the database can keep those thermal zones separate.

The result is not that indexes become less important. It is that index design becomes inseparable from lifecycle design. How long does data stay hot. When does it become read mostly. When does it become archival. Good indexing answers all of those questions indirectly because every page structure is really a bet on future access patterns.

Execution Plans Are The Only Honest Way To Judge Index Use

Because planners make cost based choices, guessing whether an index helps is rarely enough. The reliable way to understand what happened is to inspect the execution plan and compare estimated versus actual behaviour.

An execution plan shows things such as:

  • whether the planner chose a sequential scan, index scan, bitmap scan, or index only scan
  • where sorting occurred
  • how many rows were expected versus actually produced
  • whether joins used nested loop, hash, or merge strategies
  • whether temporary memory or disk spill appeared

This matters because index tuning often fails when engineers optimise by intuition alone. A query may look like it should benefit from one new index, but the plan can show that the true bottleneck is elsewhere: an underestimated join fan out, a sort on another relation, or an expression that prevents the predicate from matching the index as written.

Good index work therefore always loops through the plan. The structure exists to change the operator tree the planner chooses. If the operator tree did not improve, the index did not accomplish what was intended no matter how sensible it sounded during design.

Real Index Tuning Usually Ends With Fewer Structures, Not More

When teams first start tuning a slow database, the instinct is usually additive: create another index for this query, another for that report, another for one more dashboard. After enough iterations the schema fills with overlapping structures that each looked justified in isolation.

Mature tuning often goes the other way. After measuring workload and reading plans carefully, engineers consolidate toward fewer, better chosen indexes that serve several important paths at once. A well ordered composite index may replace several narrow single column indexes. A partial index may replace a much larger general one. An obsolete reporting index may be dropped entirely once the report moves to a warehouse.

That pattern makes sense if you remember the economics:

  • every index consumes storage
  • every index competes for cache
  • every index slows writes
  • every index adds maintenance work

So the goal is not to maximise the count of potentially useful indexes. The goal is to materialise the smallest set of orderings and access paths that most improves the actual workload. Good database design often looks sparse because each surviving structure is carrying a clear and justified share of the work.

Monitoring Index Health Is Part Of Normal Operations

Because indexes age and workloads shift, production teams usually need some routine observability around them. Useful signals include index size growth, scan counts, cache hit ratios, bloat indicators, write amplification, and whether the planner has effectively stopped using a structure that still costs every write.

Without that feedback loop, schemas drift toward historical accidents. With it, index work becomes an operational discipline instead of a one time migration habit.

Index Design Is Ultimately About Choosing What To Materialise

Every index is a decision to materialise one useful ordering or lookup path ahead of time. The database pays the storage and maintenance cost now so future queries can avoid rebuilding that path from scratch with scans, sorts, and repeated comparisons.

Seen that way, indexing is less mysterious. You are deciding which parts of future query work should be precomputed into durable page structures and which should be left for the executor to do on demand.

Most Index Mistakes Come From Treating Them As Abstract Hints

Many bad indexing decisions start with a subtle category error. Engineers treat an index as a lightweight planner hint instead of as a real storage structure with pages, cache footprint, write cost, maintenance overhead, and lifecycle consequences.

Once you correct that mental model, better decisions follow naturally. You stop asking only "will this query use the index" and start asking "is this ordering important enough that I want to materialise and maintain it continuously".

Good Indexes Usually Reflect Repeated Business Questions

In production systems, the most durable indexes are often the ones that correspond to stable business questions. Which invoices belong to this customer. Which unpaid rows are due today. Which events occurred in this time range. Which child rows belong to this parent.

Those questions recur day after day, so materialising the corresponding access path keeps paying back its maintenance cost. By contrast, one off reporting curiosities rarely deserve permanent operational structures in the primary database.

That framing is useful because it ties storage design back to product reality. Indexes are not only technical artefacts. They are durable answers to repeated questions the system expects to ask quickly.

That is also why index reviews work best when query logs, execution plans, and product behaviour are read together rather than in isolation.

Without that linkage, teams tend to keep historical indexes long after the underlying question stopped mattering to the product.

At that point the structure has become technical debt in page form: still physically present, still taxing writes, and no longer earning its place.

That is a useful way to think about old indexes because it makes the cleanup decision easier. Dropping one is not undoing good engineering. It is removing a storage obligation that no longer pays for itself.

Index Review Is Really A Form Of Workload Review

When a team audits indexes properly, it is rarely only auditing storage structures. It is usually rediscovering what the application actually asks the database to do, which paths are hot, which reports no longer matter, and which old assumptions have drifted away from reality.

The best index reviews often improve both schema design and query design at the same time for that reason.

It is about understanding demand.

The Right Mental Model

The right mental model is that a database index is a separate ordered page structure optimised to route lookups quickly. The B tree stays shallow because each page holds many keys. Lookups walk root to leaf. Range scans work because leaves are ordered. Composite indexes follow their left to right key order. Clustered designs store rows at the leaves, while secondary designs often store pointers or primary keys. Covering designs trade larger index size for fewer table reads.

Once you think in pages rather than metaphors, a lot of familiar behaviour becomes obvious:

  • page splits explain write amplification
  • leaf order explains range scan performance
  • included columns explain covering scan size trade offs
  • secondary lookups explain extra base table reads
  • bloat explains why a once fast index may slowly become less efficient

An index is not free metadata. It is a full storage structure with its own lifecycle, maintenance burden, concurrency rules, and planner cost model.

Good index design is not just "pick the filtered column". It is a storage decision. You are choosing which key order to materialise on disk, which queries deserve faster navigation, how much write cost you are willing to accept, and how much extra space you are willing to carry so the planner has a better route than scanning the whole table.