You add a bio column to your users table. A few months later, some users have written
essays in there: 50KB, 100KB. Queries that used to return in single-digit milliseconds
are now taking hundreds. The table hasn't grown that much. The indexes are fine. Nothing
obvious in EXPLAIN QUERY PLAN. What's happening is that SQLite has been quietly
allocating overflow pages every time one of those large rows was written, and reading
them back is costing you far more than you'd expect.
The Problem
SQLite stores everything (rows, index entries, internal metadata) in fixed-size pages. The default page size is 4096 bytes. Every row in a table is a "cell" within one of those pages. The rule is simple: a cell must live within a page.
But what happens when a row's data is larger than a page? SQLite doesn't reject the write. It doesn't split the row across two B-tree pages either. Instead, it takes the portion of the row that doesn't fit and stores it in a completely separate chain of pages called overflow pages. These are then stitched back together at read time by following a linked list.
This mechanism is necessary; SQLite wouldn't be useful without it. But overflow pages come with a performance cost that compounds quietly in the background until, one day, a query that scans your table takes 25 times longer than it should.
Prerequisites
- Basic familiarity with how SQLite stores data (pages, B-trees)
- Understanding of what a table scan and an index seek are
- Awareness of I/O as a performance concern in databases
How SQLite Organizes Data
Before getting into overflow, it helps to understand what SQLite is working with normally.
A SQLite database file is a flat array of fixed-size pages. Every page is the same size, typically 4096 bytes, configurable from 512 to 65536 bytes at database creation time. Pages are numbered starting from 1. The entire database (table rows, indexes, internal tree nodes) is expressed through these pages.
The primary data structure is a B-tree. For each table, SQLite maintains a B-tree where leaf nodes hold the actual row data. Interior nodes hold keys and child pointers to navigate the tree. For indexes, there's a separate B-tree where leaves hold the indexed value alongside the rowid of the corresponding table row.
Within each page, individual rows (or index entries) are called cells. A page has a small header, an array of pointers to cells sorted by key, and the cell data itself packed from the bottom of the page upward. The layout looks roughly like this:
┌─────────────────────────────────────────────────┐
│ Page Header (8-12 bytes) │
│ Cell Pointer Array [ ptr1, ptr2, ptr3, ... ] │
│ │
│ (free space in the middle) │
│ │
│ [ Cell 3 ] [ Cell 2 ] [ Cell 1 ] │ ← packed from bottom
└─────────────────────────────────────────────────┘
Each cell holds the complete payload for one row: all column values concatenated together, preceded by a small header describing the type and length of each column. This is the normal, fast case: one page read gets you the entire row.
The Overflow Threshold
SQLite doesn't let a cell grow arbitrarily. When a row's payload would make the cell too large to fit within a single page, SQLite stores only the first portion of the payload in the cell itself and puts the rest into overflow pages.
The amount stored locally depends on the page size and whether it's a table or index page. For a 4096-byte page:
- Table leaf pages: roughly 4057 bytes stored locally per cell
- Index pages: roughly 1007 bytes stored locally per cell
Index pages have a much tighter limit because index pages need to fit many entries for the B-tree to remain shallow and fast. If you push a large value through an index (say, indexing a long JSON column), you'll hit overflow at around 1KB rather than 4KB.
The cell itself stores the total payload size, the local portion of the data, and a 4-byte pointer to the first overflow page. From the outside, the row looks complete: SQLite reconstructs it transparently. From a performance perspective, that transparency comes at a cost.
The Overflow Chain
When a row overflows, the excess data is stored as a linked list of overflow pages. These pages are completely separate from the B-tree. They have no page type flag, no cell pointers, no B-tree semantics at all. Each overflow page has an extremely simple structure:
Bytes 0-3: pointer to the next overflow page (0 if this is the last)
Bytes 4-end: raw payload data
For a 4096-byte page, that's 4092 bytes of payload per overflow page. A 10KB blob stored in a table with the default page size produces roughly:
- 4057 bytes stored in the cell (in the B-tree leaf page)
- 3943 bytes remaining, covered by 1 overflow page
A 100KB blob produces:
- 4057 bytes in the cell
- ~95943 bytes remaining, spread across 24 overflow pages chained together
The full picture for a row that overflows looks like this:
B-tree Leaf Page
┌──────────────────────────────────────┐
│ Cell │
│ total_payload_size: 100000 │
│ local_data: [first 4057 bytes] │
│ overflow_ptr: ──────────────────── ┼──┐
└──────────────────────────────────────┘ │
│
┌───────────────┘
▼
Overflow Page 1
┌────────────────────────────┐
│ next: ──────────────────── ┼──┐
│ data: [4092 bytes] │ │
└────────────────────────────┘ │
│
┌───────────────────┘
▼
Overflow Page 2
┌────────────────────────────┐
│ next: ──────────────────── ┼──┐
│ data: [4092 bytes] │ │
└────────────────────────────┘ │
... ...
▼
Overflow Page 24
┌────────────────────────────┐
│ next: 0 (end of chain) │
│ data: [remaining bytes] │
└────────────────────────────┘
One important property: overflow pages are never shared between cells. Each row that overflows gets its own private chain. When a row is deleted, the entire chain is freed back to a freelist inside the database file and can be reused by future writes.
When You'll Run Into Overflow
Overflow isn't limited to obvious cases like storing files in a database. Here are the situations that produce overflow pages in practice.
Large BLOBs or TEXT columns
The most direct case. Any column storing images, PDFs, long documents, or large serialised payloads will overflow the moment its value exceeds ~4KB. This is the scenario most developers recognise, though many don't know the internal mechanism they've triggered.
JSON columns
JSON has become a common pattern in SQLite, used for storing semi-structured data without defining a rigid schema. A small JSON document is fine. But JSON documents that embed arrays of objects, nested structures, or any significant amount of text will routinely exceed 4KB. A user preferences blob, an API response cached to disk, a config document: all of these can quietly tip into overflow territory.
Long free-text fields
Bios, descriptions, notes, comments. Fields where the application enforces no length
limit and users can write as much as they want. An uncapped TEXT column in a user-
generated content context is an overflow waiting to happen.
Index overflow
This one surprises people. Even a moderate-length text value (around 1KB) will overflow when indexed. Index B-tree pages have a max local payload of roughly 1007 bytes (on a 4096-byte page), which is a quarter of the table leaf limit. If you index a URL, a hashed value, a short description, or any string that can be a few hundred characters long, you may be creating overflow on the index side. The query will still be correct; it will just require more I/O than you'd expect when the optimizer uses that index.
Rows with many small columns
This is the least obvious case. Overflow is triggered by the total cell size, not by any individual column. A row with 40 columns each containing a modest amount of data can collectively exceed 4057 bytes and trigger overflow even though no single column is especially large. Wide tables with lots of columns are worth watching.
Performance Implications
Every overflow adds I/O
Reading a row that doesn't overflow costs one I/O (assuming the page isn't already cached). Reading a row with a 100KB value costs 25 I/Os: one for the leaf page, 24 for the overflow chain. At that ratio, a query that would touch 10,000 rows with no overflow reads 10,000 pages. With overflow, it reads 250,000 pages. The query is doing 25 times more work.
The relationship is linear: the more overflow pages a row has, the more I/Os are required to materialise it.
Overflow pages are scattered on disk
Overflow pages are allocated from the freelist (pages reclaimed from previous deletes) or appended to the end of the file. They are not adjacent to the B-tree page they belong to, and they are not guaranteed to be adjacent to each other in the chain.
A freshly created database with no deletes will have overflow pages that are roughly sequential on disk, which is a tolerable access pattern. But after any significant write/delete churn, overflow pages fragment throughout the file:
Disk layout after churn:
Page 5: B-tree leaf (overflow_ptr → 1023)
Page 47: Overflow #3 (next → 0)
Page 891: Overflow #2 (next → 47)
Page 1023: Overflow #1 (next → 891)
Reading this row requires: seek to 5, seek to 1023, seek to 891, seek to 47
Four random seeks instead of one sequential read.
On a spinning disk this is a genuine disaster. On SSD the cost is lower but not zero: random reads still consume more bandwidth and IOPS than sequential reads, and every hop through the chain is a separate read request that can't be merged or prefetched by the OS.
You can't skip a large column to reach the next one
SQLite's record format stores column values end-to-end, with a header that describes the type and length of each value. To find where column N starts, you must know where column N-1 ends. If column N-1 has overflow, you must follow the overflow chain just to discover the starting offset of column N.
This means that even a query that only projects small columns pays the overflow cost for any large column that appears earlier in the row:
-- Table: users(id INT, name TEXT, bio TEXT, email TEXT)
-- bio is 100KB
SELECT email FROM users WHERE name = 'Alice';
To read email, SQLite must parse the record header and locate email's starting
offset. To know where email starts, it must know where bio ends. To know where
bio ends, it must follow bio's entire overflow chain. Even though the query never
asked for bio, 24 extra I/Os happen per matching row.
Column ordering matters. A 100KB column early in the schema taxes every query that reads any column appearing after it, regardless of whether those queries need the large column.
Table scans with overflow are compounding
A table scan visits every leaf page in the B-tree. If rows have overflow, the scan also visits every overflow page for every row. There's no way to scan just the B-tree portion and skip the overflow chains; the record format requires following them.
| Scenario | I/O per row | 1M row table scan | | ------------------------------ | ----------- | ----------------- | | Small rows (< 4KB) | 1 | ~1M reads | | 10KB rows (2 overflow pages) | 3 | ~3M reads | | 100KB rows (24 overflow pages) | 25 | ~25M reads |
Full table scans are already a last resort. Overflow makes them dramatically worse.
What You Can Do
Move large columns to a separate table
The most effective schema-level fix. If you isolate overflow-prone columns into their own table, queries that don't need those columns never touch overflow pages.
-- Before
CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, bio TEXT, email TEXT);
-- After
CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT);
CREATE TABLE user_bios (user_id INTEGER PRIMARY KEY, bio TEXT);
SELECT name FROM users WHERE name = 'Alice' now reads nothing but compact rows.
SELECT bio FROM user_bios WHERE user_id = 42 still pays the overflow cost, but only
when you explicitly asked for the bio. Overflow is now a cost you opt into per query
rather than one you pay on every access.
Use covering indexes for hot query paths
A covering index includes all the columns a query needs. SQLite can satisfy the query entirely from the index without touching the table rows, and therefore without following any overflow chains in the table.
CREATE INDEX idx_users_name_email ON users(name, email);
SELECT email FROM users WHERE name = 'Alice';
The index entry for this query contains both name and email. SQLite traverses the
index B-tree, finds the matching entry, and returns the result without ever reading the
table row. If bio is in the table and causing overflow, this query is completely
unaffected.
The catch: index entries can themselves overflow, as noted earlier. Don't index large columns. Keep covering indexes to small, fixed-width columns.
Don't store large BLOBs in the database
The cleanest fix when you're storing actual binary content: images, documents, audio. Store the object in an object store (S3, GCS, a local filesystem) and keep only the reference in SQLite:
CREATE TABLE documents (id INTEGER PRIMARY KEY, title TEXT, storage_key TEXT);
storage_key is a few dozen bytes at most. No overflow, ever. The large content is
fetched separately when actually needed. This also moves retrieval of large objects
off the database hot path entirely.
Increase the page size
A larger page size raises the overflow threshold. With a 16384-byte page, the local payload limit for a table leaf is roughly 16357 bytes. Rows up to ~16KB will now fit without overflow.
PRAGMA page_size = 16384;
VACUUM;
This must be done before writing any data, or on an empty database. On an existing
database, you need to run VACUUM after changing page_size to rebuild the file with
the new page layout. Larger pages mean more data is read even when only part of a page
is needed, so this is a trade-off: you reduce overflow but increase the cost of random
row lookups that would have been satisfied by a smaller page.
Run VACUUM to defragment overflow chains
VACUUM rebuilds the database file from scratch. Pages are written sequentially in
B-tree order, and overflow pages are written immediately after the cell they belong to.
The result is a file where overflow chains are as contiguous as they can be:
VACUUM;
After a VACUUM, a row and its overflow pages are adjacent on disk. Sequential reads
can now be served by the OS read-ahead buffer rather than individual random seeks. On
both SSDs and spinning disks, this is meaningfully faster.
VACUUM is not a permanent fix; churn gradually re-fragments the file. On write-heavy
workloads with frequent deletes, the benefit decays over time and you'd need to run it
periodically.
Put large columns last in the schema
Since column offsets are computed by parsing preceding columns in order, placing large overflow columns at the end of the row means that any query touching only early columns never needs to follow the overflow chain. The record parser stops as soon as it has the columns it needs.
This is a low-cost mitigation when schema changes are difficult. It won't eliminate overflow, but it limits the blast radius to queries that actually need those columns.
How It All Fits Together
A query that reads a row with overflow:
B-tree traversal
│
▼
Leaf page (1 I/O)
│
│ cell → local_data + overflow_ptr
▼
Overflow page 1 (1 I/O, likely random seek)
│
▼
Overflow page 2 (1 I/O, likely random seek)
│
...
▼
Overflow page N (1 I/O, likely random seek)
│
▼
Reconstruct full row from all chunks
│
▼
Return to query engine
Overflow pages are not part of the B-tree. They carry no keys, no pointers relevant to tree traversal, no page type metadata. They are pure sequential storage: a linked list stitched onto the B-tree at the cell level. The B-tree gets you to the right leaf. The overflow chain gets you the rest of the row. Both are necessary. Only the overflow chain is under your control via schema design.
Lessons Learned
The overflow threshold for index pages is a quarter of the table threshold. 1007 bytes vs. 4057 bytes on a 4096-byte page. This surprises most people who think of overflow as a "large BLOB" problem. Indexing a moderately long text column (a URL, an address, a product name with Unicode characters) can push an index entry into overflow territory.
Column order is a real performance variable. A large column early in the schema forces every query reading any later column to pay the overflow I/O cost, even if the query never touches the large column directly. This is easy to miss because the query result is correct regardless.
VACUUM helps more than people expect. After heavy write churn, overflow chains that were once sequential become scattered randomly across the file. A VACUUM restores locality. On databases that are read-heavy but written in batches, scheduling periodic VACUUMs can recover significant read performance without any schema changes.
Increasing page size is not free. A 16KB page means SQLite reads at least 16KB when accessing any row: useful if rows are large, wasteful if they're small and many. The right page size depends on your typical row size. For a workload with mostly compact rows and occasional large ones, normalising the large columns out is usually better than inflating the page size for everyone.
Overflow is not a bug. It is SQLite's correct and necessary mechanism for handling payloads larger than a page. The goal isn't to eliminate overflow entirely. The goal is to understand which queries are paying the cost and whether schema or query design can avoid paying it unnecessarily.
References
- SQLite File Format — B-tree Pages
- SQLite File Format — Overflow Pages
- SQLite PRAGMA page_size
- SQLite VACUUM
- SQLite EXPLAIN QUERY PLAN
- Use The Index, Luke — Clustering and Index-Only Scans
Conclusion
Please reach out to me here for more ideas or improvements.