Gaurav Sarma

You write a query against a table with 500,000 rows and an index on the column you're filtering by. The query takes 800ms. You check the index is there. It is. You run EXPLAIN QUERY PLAN and see "SCAN table" where you expected "SEARCH table USING INDEX". The index exists but the planner chose not to use it. Why? The answer is almost never "SQLite is broken". It is almost always something you did that made the index inaccessible, or that made a full scan look cheaper than an index seek to the planner.

This post covers how SQLite's query planner works, how it makes decisions, and the specific patterns that cause it to make the wrong ones.

The Problem

SQLite's query optimiser is simpler than PostgreSQL's or MySQL's. It doesn't have a full cost-based planner with table statistics informing every decision. It uses a rule-based approach for a lot of choices, supplemented by lightweight statistics when ANALYZE has been run. This means the planner is predictable, but it also means the responsibility for giving it the right conditions falls more squarely on you.

Understanding the planner well enough to avoid its blind spots is the difference between queries that run in single-digit milliseconds and queries that silently scan millions of rows every time they execute.

Prerequisites

Familiarity with the previous posts in this series: SQLite overflow pages and the general concept of B-tree indexes
Basic understanding of what an index is and why it's faster than a full scan
Comfort reading SQL queries in prose form

The Schema Used Throughout This Post

Rather than switching examples with every section, every concept in this post is grounded in one schema: a simple e-commerce database with four tables.

users has 1,000,000 rows. Each row has an id, email, name, status, and created_at. Status values are 'active' (900,000 rows), 'suspended' (80,000), and 'deleted' (20,000).

orders has 5,000,000 rows. Each row has an id, user_id, status, created_at, and total_amount. Status values are 'delivered' (3,500,000 rows), 'shipped' (1,000,000), 'pending' (400,000), and 'cancelled' (100,000).

order_items has 20,000,000 rows. Each row has an id, order_id, product_id, status, quantity, and unit_price. Status values are 'active' (18,000,000 rows), 'returned' (1,500,000), and 'refunded' (500,000).

products has 50,000 rows. Each row has an id, name, category, and price.

users        (1,000,000 rows)   id | email | name | status | created_at
orders       (5,000,000 rows)   id | user_id | status | created_at | total_amount
order_items  (20,000,000 rows)  id | order_id | product_id | status | quantity | unit_price
products     (50,000 rows)      id | name | category | price

This schema is intentionally skewed. The orders.status distribution is lopsided: most orders are delivered and a small fraction are pending or cancelled. That skew will matter when we get to statistics.

How SQLite Executes a Query

Before getting into optimisation, it helps to understand what the planner is actually doing.

SQLite compiles each SQL statement into a program for a register-based virtual machine called the VDBE (Virtual Database Engine). The compilation step is where the planner operates: it takes the parsed query and decides how to satisfy it: which indexes to use, what order to join tables, whether to sort or use an indexed order.

The compiled program is a sequence of opcodes. Each opcode manipulates a small set of registers and drives the cursor(s) that walk through the B-tree. There is no vectorised execution, no parallel workers, no pre-fetching pipeline. SQLite processes one row at a time, in order, using a nested-loop model. The efficiency of a query comes almost entirely from whether the planner can avoid reading rows that don't contribute to the result.

Query lifecycle:

SQL text
   │
   ▼
Parser (builds AST)
   │
   ▼
Query planner (chooses access paths and join order)
   │
   ▼
Code generator (emits VDBE opcodes)
   │
   ▼
VDBE (executes opcodes, drives B-tree cursors)
   │
   ▼
Result rows

The planner's job is to answer one question for each table in the query: "which rows do I need, and what is the cheapest way to find them?"

Indexes in SQLite

SQLite indexes are B-trees. Each entry in an index B-tree stores the indexed column values followed by the rowid of the corresponding table row. The index entries are sorted by the indexed values.

When a query filters on an indexed column, the planner can do a binary search through the index to find the starting point, then walk forward through the leaf pages to collect matching entries. Each entry gives a rowid; the planner then uses that rowid to do a point lookup in the table B-tree to fetch the full row.

Compare that to a full table scan, which visits every leaf page of the table B-tree in order. For orders with 5,000,000 rows spread across tens of thousands of pages, a full scan reads every page. An index-driven lookup reads only the index pages needed to find the matching rowids, then one table page per matching row.

The relative cost depends on selectivity: how large a fraction of the table matches the filter condition. An index is useful when only a small fraction matches. When most rows match, the index lookup costs more than a scan, because each rowid match requires a random read into the table B-tree rather than the sequential reads a scan produces.

Reading EXPLAIN QUERY PLAN

Before debugging a slow query, the first tool to reach for is EXPLAIN QUERY PLAN. It shows the access strategy the planner chose, without executing the query.

The output is a short table with one row per table access in the plan. The important column is detail, which describes the access method. The three things you'll see most:

SCAN table: A full table scan. Every row is visited. This is not always wrong (for small tables or high-selectivity conditions, it may be the right choice), but it is the first thing to look for when a query is slow.

SEARCH table USING INDEX index_name: An index seek. The planner found a usable index, performed a binary search to the starting point, and walked forward to collect matching rows.

SEARCH table USING COVERING INDEX index_name: The planner found not just a usable index but one that contains all the columns the query needs. The table B-tree is never touched. This is the fastest possible access path for a filtered read.

When you see SCAN where you expected SEARCH, that is the signal to investigate why the index was not used.

When the Planner Uses an Index

SQLite can use an index when the filter condition on the indexed column is one the planner knows how to translate into a range search on the index B-tree.

Equality conditions

An equality filter on an indexed column is the clearest case. Filtering orders where user_id = 42 on a table with an index on user_id allows the planner to binary-search the index to the first entry where user_id = 42, walk forward collecting all matching entries, and stop when the value changes. For a user with 200 orders in a 5,000,000-row table, that is roughly 200 rows read instead of 5,000,000.

Range conditions

Range filters on an indexed column also work. Filtering orders where created_at > '2026-01-01' on an indexed created_at column allows the planner to binary-search to the first entry at or after that timestamp and walk forward to the end of the index. The planner can use the index for any of the standard comparison operators: <, >, <=, >=, and BETWEEN.

Multi-column indexes and the left-prefix rule

A multi-column index on (status, user_id, created_at) on orders sorts entries first by status, then by user_id within each status value, then by created_at within each user. The planner can use this index for any filter that references a prefix of the indexed columns from left to right.

A filter on status = 'pending' uses the index, narrowing from 5,000,000 rows to roughly 400,000. A filter on status = 'pending' AND user_id = 42 uses the index more selectively, narrowing to the handful of pending orders for that user. A filter on status = 'pending' AND user_id = 42 AND created_at > '2026-01-01' uses it most selectively of all.

A filter on only user_id = 42 or only created_at > '2026-01-01' cannot use this index. Without constraining status first, the matching entries are scattered throughout the entire index: there is no contiguous range to search.

This is the left-prefix rule. The index is useful for any prefix of its column list, from left to right, as long as each column in the prefix has an equality or range constraint. Once you have a range constraint on a column, the planner can use subsequent columns in the index for filtering but not for narrowing the initial search range.

What Prevents Index Use

Functions applied to the indexed column

The most common mistake. Applying a function to the column value in a filter condition makes the index inaccessible.

The index on users.email stores raw email values in sorted order. A filter like lower(email) = 'alice@example.com' wraps the column in a function. The index stores 'Alice@example.com', not 'alice@example.com'. The planner cannot binary-search the index for a value that doesn't appear in it, so it falls back to a full scan of 1,000,000 users.

Similarly, filtering orders with strftime('%Y', created_at) = '2026' wraps created_at in a function. The index on created_at stores raw timestamps; it has no entries for computed year values. A full scan of 5,000,000 orders follows.

The fix in each case is to move the transformation to the other side of the comparison. Instead of lower(email) = 'alice@example.com', enforce lowercase at write time and filter where email = 'alice@example.com'. Instead of strftime('%Y', created_at) = '2026', filter where created_at >= '2026-01-01' AND created_at < '2027-01-01'. Both rewrites operate on raw column values, which the index does store.

When the function cannot be moved to the other side, the index cannot be used. A computed column with its own index is the escape hatch: store the computed value explicitly and index that column instead.

LIKE with a leading wildcard

A LIKE filter with a leading wildcard (name LIKE '%alice%' on users) cannot use a B-tree index for the same reason as a function: there is no contiguous range of matching entries in the sorted index. The planner must scan all 1,000,000 users and apply the pattern to each name.

A LIKE filter with a trailing wildcard (name LIKE 'alice%') can use a B-tree index. The matching entries form a contiguous range in the sorted index: everything from 'alice' (inclusive) to 'alicf' (the next string after all strings starting with 'alice'). The planner can binary-search to the start of that range and walk forward.

This is why "starts with" searches are fast and "contains" searches are not, in SQLite's standard indexes. Full-text search (FTS5) is the correct tool for contains searches on users.name or products.name.

Type affinity mismatches

SQLite has a type affinity system rather than strict types. Each column has an affinity (NUMERIC, INTEGER, TEXT, REAL, BLOB), and SQLite applies affinity rules when comparing values. A mismatch between the stored affinity and the type of the comparison value can prevent index use.

The most common case on this schema: orders.user_id is defined with INTEGER affinity. Filtering where user_id = '42' (a text literal) may or may not use the index on user_id depending on the affinity rules applied. In SQLite's type system, the integer 42 and the text '42' are different values. The planner may correctly choose not to use the index when it cannot determine that the affinity conversion will produce a match.

The fix is to always compare with values of the correct type: integer literals for user_id, order_id, and product_id; quoted strings for status, email, and name.

OR conditions

A filter with OR between conditions on different columns, for example filtering orders where status = 'pending' OR user_id = 42, is a pattern the planner handles inconsistently. For an OR condition across two indexed columns, the planner needs to perform two separate index lookups and merge the results, deduplicating rowids.

SQLite does support this in some cases through the OR optimisation: if both conditions are on indexed columns in the same table, the planner can use both indexes, collect the rowids from each, sort and merge them, and then fetch the matching rows. The result of EXPLAIN QUERY PLAN will show two SEARCH steps followed by a merge.

But this only works when each branch of the OR is independently satisfiable by an index. If status is indexed but user_id is not, the condition status = 'pending' OR user_id = 42 requires a scan of all 5,000,000 orders regardless of the index on status, because there is no efficient way to find all rows where user_id = 42.

The reliable fix is to rewrite as a UNION: one query filtered by status = 'pending', another filtered by user_id = 42, combined with UNION to deduplicate. This gives the planner two simple, independently indexable queries.

NOT and inequality filters on low-selectivity conditions

The planner will use an index for inequality filters (!=, NOT IN, NOT LIKE), but only when it estimates the filter to be selective enough. A condition like orders.status != 'cancelled' eliminates only 100,000 rows from a 5,000,000-row table. The filter matches 98% of rows. A scan is almost certainly cheaper than an index seek that produces 4,900,000 rowids each requiring a random table lookup.

Without ANALYZE data, the planner uses hard-coded heuristics for selectivity. It tends to overestimate the selectivity of inequality filters, sometimes choosing an index-driven path when a scan would be faster. Running ANALYZE gives the planner actual row count estimates to work with.

The Role of ANALYZE

ANALYZE scans each index and builds summary statistics stored in the sqlite_stat1 table. For each index, it records the approximate number of rows per unique combination of indexed values.

These statistics change the planner's cost estimates significantly. Without them, the planner uses hard-coded heuristics. With them, it can compare the estimated number of rows returned by each index and choose the most selective one.

The sqlite_stat1 table has one row per index. The stat column is a space-separated list of integers: the total row count in the table, followed by the average number of rows per unique value of the first column in the index, then per unique pair of the first two columns, and so on.

For an index on (status, user_id) on orders, after ANALYZE, the stat might read 5000000 1250000 5. This tells the planner: there are 5,000,000 rows total; on average, 1,250,000 rows share each value of status (roughly right for four distinct values); on average, 5 rows share each (status, user_id) pair (roughly right for a million users each with a few orders per status). A filter on status = 'pending' AND user_id = 42 is estimated at 5 rows, highly selective, and the planner will strongly prefer the index.

Run ANALYZE after bulk loads, significant inserts, or schema changes. It is not run automatically. The statistics become stale as data changes; if the ratio of pending to delivered orders shifts significantly, the planner's estimates will be wrong until ANALYZE is run again.

The limit of sqlite_stat1: averages hide skew

sqlite_stat1 stores averages. For a column with a uniform distribution, an average is a reasonable proxy for any specific value's selectivity. But the orders schema is explicitly skewed, and that skew exposes the limitation directly.

With four distinct status values and 5,000,000 rows, the average is 1,250,000 rows per status. But the actual distribution is: 'delivered' (3,500,000 rows), 'shipped' (1,000,000), 'pending' (400,000), and 'cancelled' (100,000). The planner uses 1,250,000 as its estimate for every equality filter on status, regardless of which value is being queried.

A filter on status = 'cancelled' matches 100,000 rows. The planner thinks it matches 1,250,000. It may conclude the index is not selective enough to be worth using and fall back to a full scan of 5,000,000 rows. A filter on status = 'delivered' matches 3,500,000 rows. The planner still thinks it matches 1,250,000, so it may use the index when a scan would actually be cheaper.

The average is wrong in both directions simultaneously. This is the fundamental limitation of sqlite_stat1 for skewed data.

sqlite_stat4: sample-based estimates

sqlite_stat4 addresses this by storing actual sample rows from each index rather than just averages. When ANALYZE runs with stat4 enabled, it samples up to 24 representative rows from each index, recording their key values and the number of rows that fall between consecutive samples. This gives the planner a histogram it can interpolate against.

For the status = 'cancelled' filter, the planner can find 'cancelled' among the samples, read the associated row count (100,000), and produce an accurate estimate rather than the global average. For a range query on created_at (say, all orders in January 2026), sqlite_stat1 has no way to estimate how many rows fall within that window. stat4's samples let the planner interpolate between known sample points and scale by the inter-sample density.

The compile-time caveat

sqlite_stat4 is not enabled in all SQLite builds. It requires the SQLITE_ENABLE_STAT4 flag at compile time. Most prebuilt SQLite distributions (including the one embedded in Python, Android, iOS, and the majority of Linux packages) do not enable it. You are almost certainly running with stat1 only unless you compiled SQLite yourself or are using a distribution that explicitly enables stat4.

You can check by querying the stat4 table directly. If it raises an error, stat4 is not compiled in. If it returns rows after ANALYZE, it is.

The practical implication: on standard SQLite builds, the planner's estimates for range queries and skewed equality filters are always based on averages. For tables like orders with a heavily skewed status distribution, ANALYZE alone may not be enough to prevent a bad plan. Schema changes (partial indexes on the sparse values, separating hot rows into a smaller table) often matter more than statistics in these cases.

Join Optimisation

SQLite uses a nested-loop join strategy. There are no hash joins, no merge joins, no parallel join workers. For every pair of tables in a query, SQLite picks one as the outer table and one as the inner table, then for each row produced by the outer side, it performs a lookup into the inner table for matching rows. If there are three tables, the result of the first two becomes the outer input for the third, and so on.

This model is simple and predictable, but its cost is entirely determined by two things: which table is outer and whether the inner table has an index on the join column. Get both of those right and the join is fast. Get either one wrong and the cost compounds badly.

The cost model

Consider fetching all order items for a specific user. The query joins orders to order_items on orders.id = order_items.order_id, with a filter on orders.user_id = 42. User 42 has 200 orders.

If orders is outer (filtered to 200 rows via an index on user_id) and order_items has an index on order_id, the planner does one binary search into order_items per order: roughly 24 comparisons to reach the matching entries (log₂ of 20,000,000), then fetches the matching rows. For 200 orders that is about 4,800 index operations plus the actual row reads.

If the join order is reversed, with the unfiltered order_items (20,000,000 rows) as the outer table and no index on orders.user_id for the inner lookup, each of the 20,000,000 item rows triggers a lookup into orders. Without an index on user_id, each lookup is a full scan of the 5,000,000-row orders table. The numbers do not need to be computed to understand the problem.

The difference is not subtle. The variables are: which table is outer, how many rows it produces after filtering, and whether the inner join column is indexed.

How the planner chooses join order

SQLite's planner enumerates all possible join orderings for queries with up to about 7 tables and estimates the cost of each. The cheapest ordering wins.

The cost estimate for a join ordering is based on:

The estimated number of rows produced by the outer side (after applying any WHERE filters on that table)
The cost of each inner lookup (index seek vs. full scan)
Multiplied together, then summed across all tables

Without ANALYZE data, the planner estimates table sizes from page counts and uses fixed selectivity estimates for filter conditions. With ANALYZE, it uses actual row distributions from sqlite_stat1 to estimate how many rows survive each filter.

The implication is direct: a query that joins orders filtered to status = 'pending' (400,000 real rows) with users (1,000,000 rows) should put orders as the outer table. Without statistics, the planner may estimate the status = 'pending' filter at the global average of 1,250,000 rows, making orders appear larger than users and potentially reversing the join order. Running ANALYZE gives it the real number of 400,000.

Join order with a WHERE filter on the inner table

A WHERE filter on the inner table changes the dynamics. Consider joining orders to order_items on orders.id = order_items.order_id, with a filter on order_items.status = 'returned'.

If order_items is inner and has an index on (order_id, status), the planner can binary-search to the exact range of entries where order_id = ? and then apply the status = 'returned' filter while walking that range. The combined index means both the join condition and the WHERE filter are applied in a single index traversal.

If the index only covers order_id, the planner can use the index for the join but must then apply status = 'returned' as a row-by-row filter after fetching each row from the table. For an order with 10 items where only 1 is returned, that means 10 table lookups to find 1 matching row. The (order_id, status) index produces 1 lookup directly.

The general rule: for inner tables that have both a join condition and a WHERE filter, an index that covers the join column first and the filter column second will almost always outperform an index on either column alone.

The wrong join order is silent

When the planner chooses the wrong join order, no error is raised, no warning is emitted. The query returns the correct result. It just takes longer than it should. The only way to detect it is to run EXPLAIN QUERY PLAN and check which table appears as the outer table in each join step.

EXPLAIN QUERY PLAN reports join steps as nested SCAN or SEARCH entries. The first table listed at each nesting level is the outer table. If you see the large, unfiltered table as the outer input feeding into the small, filtered table as the inner lookup, the order is wrong.

The fix is either to run ANALYZE so the planner has accurate row counts, or to force the join order by rewriting the query. SQLite respects the order of tables in a CROSS JOIN as a hint: FROM orders CROSS JOIN order_items forces orders to be outer even if the planner would have chosen otherwise.

Three-table join example

Consider a query that fetches the names of products in orders placed by user 42. This joins orders, order_items, and products.

User 42 has 200 orders. Each order has on average 4 items. Those 800 items reference various products.

The optimal join order is:

Start with orders filtered by user_id = 42 using an index on orders.user_id: 200 rows
For each order, look up its items in order_items via an index on order_items.order_id: ~800 total rows across all orders
For each item, look up the product via order_items.product_id referencing the primary key of products: 1 lookup per item

Total inner lookups: 200 (order_items) + 800 (products) = 1,000 indexed lookups.

The worst join order would be to start with the unfiltered products table (50,000 rows outer), then look up matching order_items for each product, then check whether those items belong to orders from user 42. Even with indexes, this visits far more rows than the optimal order because the highly selective user_id = 42 filter is applied last rather than first.

The planner will find the good ordering if indexes exist on orders.user_id, order_items.order_id, and order_items.product_id, and if ANALYZE has been run so it knows that user_id = 42 produces 200 rows rather than an estimated fraction of 5,000,000.

Subqueries vs. joins

In SQLite, a correlated subquery in the WHERE clause is executed as a nested loop: once per row in the outer query. A query that finds all users who have at least one 'pending' order, written as a correlated subquery against orders, has the same cost profile as a nested-loop join: once per user, the subquery scans or seeks into orders.

Whether a correlated subquery or an explicit JOIN performs better depends on the planner's ability to push the filter down. SQLite's planner does flatten some subqueries into joins, but not all. If EXPLAIN QUERY PLAN shows a correlated subquery being evaluated as a full scan of orders for every user row, rewriting it as an explicit JOIN on users.id = orders.user_id with a filter on orders.status = 'pending' often gives the planner more flexibility to choose an efficient order and use indexes correctly.

Self-joins

A self-join (joining a table to itself) follows the same rules. Suppose users had a referred_by_user_id column tracking which user referred each new signup. Walking one level of the referral tree (finding all users referred by user 42) joins users to itself on users.referred_by_user_id = users.id with a filter on the outer alias's id = 42. An index on referred_by_user_id makes the inner lookup fast; without it, every outer row triggers a full scan of all 1,000,000 users.

Covering Indexes

A covering index is an index that contains all the columns a query needs: filter columns, projected columns, and any columns used in ORDER BY or GROUP BY. When the planner can satisfy the entire query from the index alone, it never touches the table B-tree.

Consider a query that fetches the email of all 'active' users. With an index only on users.status, the planner binary-searches the index to find rows where status = 'active' (900,000 of them), then does a table lookup for each rowid to fetch email. That is 900,000 random reads into the users heap.

If the index is extended to (status, email), the index entry already contains email. The planner can return it directly without touching the table. Those 900,000 random reads become sequential reads through the index. For a hot query running thousands of times a day, this is a significant difference.

The trade-off is index size and write overhead. An index on (status, email) is significantly larger than one on status alone. Every insert and update to users must maintain the additional index entry. Covering indexes are worth building for the queries that run most frequently on the largest tables; they are not worth it for every query.

ORDER BY and Avoiding Sorts

When a query has an ORDER BY clause, SQLite must either produce rows in the required order directly or collect all rows and sort them at the end. The sort is an additional pass over the result set and is expensive when the result set is large.

An index whose column order matches the ORDER BY column order eliminates the sort. The planner can walk the index in order and produce rows already sorted, without materialising and sorting the entire result set.

Consider fetching a user's recent orders sorted by created_at DESC. An index on (user_id, created_at) on orders allows the planner to binary-search to user 42's entries, then walk backward through created_at values in descending order. No sort step is needed.

Without this index, the planner fetches all of user 42's orders using whatever index it has, collects them, and sorts. When EXPLAIN QUERY PLAN shows USE TEMP B-TREE FOR ORDER BY, a sort is happening. That is the signal that the index structure does not match the ORDER BY requirement.

The same principle applies when filtering and ordering together. An index on (status, created_at) on orders with a filter on status = 'pending' and ORDER BY created_at allows the planner to binary-search to the pending entries and walk forward in created_at order without a sort: all 400,000 pending orders delivered in chronological order directly from the index.

Partial Indexes

SQLite supports partial indexes: indexes that cover only a subset of rows, defined by a WHERE clause on the index. A partial index on (user_id, created_at) WHERE status = 'pending' on orders contains entries only for the 400,000 pending orders, not the full 5,000,000. It is smaller than a full index, faster to maintain, and available to the planner for any query whose WHERE clause is compatible with the index condition.

This is particularly useful given the skewed status distribution. Most operational queries care about pending and shipped orders, the small, active portion of the table. An index covering only those rows is a fraction of the size of a full index and avoids the planner's confusion about how selective status filters are: a partial index on WHERE status = 'pending' is definitionally selective, because it only contains 400,000 rows.

The planner will use it automatically when the query's WHERE clause implies the index's WHERE clause. A query that filters status = 'pending' AND user_id = 42 can use the partial index on (user_id, created_at) WHERE status = 'pending' because the query's condition implies the index's condition.

The catch is specificity: the index is only usable for queries whose filter is compatible with the index condition. The partial index on WHERE status = 'pending' cannot help a query filtering for status = 'shipped'.

How It All Fits Together

A query arrives:

Parse and bind parameters
      │
      ▼
Enumerate candidate indexes for each table
      │
      ├── Check: is the index column used in a WHERE clause?
      ├── Check: is the comparison operator compatible (=, <, >, LIKE prefix)?
      ├── Check: is the column free of wrapping functions?
      └── Check: does the index satisfy the left-prefix rule for multi-column indexes?
      │
      ▼
Estimate cost of each candidate (rows read, table lookups required)
      │  - Uses sqlite_stat1 if ANALYZE has been run
      │  - Falls back to hard-coded heuristics otherwise
      │
      ▼
Choose lowest-cost access path per table
      │
      ▼
Choose join order (all permutations for <= 7 tables, heuristics beyond)
      │
      ▼
Emit VDBE opcodes
      │
      ▼
Execute: walk B-trees, filter rows, project columns, sort if needed

Every step in this pipeline has a way to go wrong. Wrapping orders.created_at in strftime() removes the index from consideration in the first pass. A stale stat that no longer reflects the real pending/cancelled distribution leads to a wrong cost estimate in the second. A missing index on order_items.order_id makes the join to order_items scan 20,000,000 rows per order. A missing created_at suffix on the user's orders index forces a sort that could have been avoided.

Lessons Learned

The planner only knows what you tell it. If ANALYZE has not been run, the planner is guessing selectivity with heuristics. On the orders table with its skewed status distribution, those heuristics will be wrong for 'cancelled' queries (100K rows, estimated at 1.25M) and wrong in the other direction for 'delivered' queries (3.5M rows, also estimated at 1.25M). Run ANALYZE after any significant data change.

Functions on indexed columns are silent index killers. A filter on lower(email) scans all 1,000,000 users even if email has an index. The query returns the right result; the planner simply does far more work than necessary. Enforce normalisation at write time and filter on the raw column value.

OR across columns is a signal to reconsider. The condition status = 'pending' OR user_id = 42 on orders is harder for the planner than two independent queries combined with UNION. When an OR condition causes a scan where you expected index use, the UNION rewrite is the most reliable fix.

A covering index is the ceiling, not the floor. The goal of index design is not just "don't scan": it is "don't even touch the table". For the query that fetches active users' emails thousands of times a day, the 900,000 table lookups after the index seek cost more than the index traversal itself. An index on (status, email) eliminates them entirely.

Join column indexes on the inner table are not optional. A missing index on order_items.order_id turns a join from orders into 5,000,000 full scans of a 20,000,000-row table. The cost compounds in a way that is easy to underestimate until you see the query plan.

EXPLAIN QUERY PLAN is cheap to run and expensive to ignore. Add it to your development workflow for any query that touches more than a few thousand rows. The output is terse and readable in seconds. The queries it would flag can silently degrade for months in production before someone notices.

References

SQLite Overflow Pages - When Your Rows Don't Fit - previous post in this series on SQLite storage internals
SQLite Query Planner Overview
SQLite EXPLAIN QUERY PLAN
SQLite ANALYZE
SQLite Query Optimiser Tracing
SQLite Partial Indexes
SQLite Expression Indexes
Use The Index, Luke - Index Selectivity

Conclusion

Please reach out to me here for more ideas or improvements.

SQLite Query Optimisation - How the Planner Thinks and Where It Goes Wrong