Gaurav Sarma

Back in 2016 or 2017, I first encountered a problem where our system just wasn’t handling the load we were throwing at it. That was my first real "deep dive" into Elasticsearch (ES). I was looking for a way to distribute load across clusters, and while ES handled the load beautifully, it also opened my eyes to the beautiful complexity of how data is actually laid out under the hood. If you’ve ever wondered why your write patterns affect performance or how ES manages to be so fast, you have to look at Lucene segments. Let’s break down the internals.

The Anatomy: Shards, Indices, and Segments

To understand ES, you first have to understand its relationship with Lucene. In the ES world, an index is like a table in Postgres. But physically, an index is made up of shards. A shard is the functional, scalable unit of your data—it’s a physical container that, internally, is actually a complete Lucene index. Because it’s a full Lucene index, it can conduct searches independently without needing extra metadata from other sources. If we go one level deeper, a Lucene index is made of multiple segments. For Lucene, the segment is the most atomic and granular unit of the data store.

┌─────────────────────────────────────────────────────────────────────────────┐
│ ELASTICSEARCH SHARD ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ 1. The Shard Unit (Physical Container) │
│ [Storage] ──► A directory on disk containing Lucene files │
│ [Immutability] ──► Composed of multiple immutable Segments (.seg) │
│ [Scaling] ──► Smallest unit moved during Cluster Rebalancing │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ 2. Primary Shard (The Writer) │
│ [Write Path] ──► Validates request ──► Buffers ──► Syncs Replicas │
│ [Sequencing] ──► Assigns sequence numbers for consistency │
│ [Status] ──► One per shard group; Must be active for writes │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ 3. Replica Shards (The Readers) │
│ [Redundancy] ──► Exact copies of the Primary on different nodes │
│ [Read Throughput]──► Parallelizes search queries across multiple nodes │
│ [Failover] ──► Promoted to Primary if the original node fails │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ 4. Internal Shard Components │
│ [Inverted Index] ──► Text search engine (Terms ──► Docs) │
│ [BKD Tree] ──► Numeric/Geo spatial index │
│ [Global Checkpt] ──► Tracks synchronization state between replicas │
│ [Translog] ──► Local WAL for recovering uncommitted segments │
└─────────────────────────────────────────────────────────────────────────────┘

Multiple Representations of Data

One of the coolest things about Lucene is that it doesn’t just store your data once. It stores it in multiple formats within a segment to support different query types: • Inverted Index: The classic search engine structure (popularised by Google) used to find terms across documents. • Doc Values: A columnar store used for aggregations (like calculating totals or bucketing data). • BKD Trees: K-dimensional trees used for complex geospatial or multidimensional searches.

┌─────────────────────────────────────────────────────────────────────────────┐
│ LUCENE SEGMENT (Immutable) │
├─────────────────────────────────────────────────────────────────────────────┤
│ 1. Inverted Index (Search Core) │
│ [Term Dictionary] ──► [Term Index (FST)] │
│ [Postings Lists] ──► {DocID, TermFreq, Positions, Offsets, Payloads} │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ 2. Stored Fields (Document Storage) │
│ [Field Index (.fdx)] ──► Pointer to Document Row │
│ [Field Data (.fdt)] ──► {Field1, Field2, ...} (Row-based storage) │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ 3. DocValues (Columnar Storage) │
│ [Field A] ──► [Val 1, Val 2, Val 3, ...] (Optimized for Sorting/Aggr) │
│ [Field B] ──► [Val 1, Val 2, Val 3, ...] │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ 4. Metadata & Auxiliary Structures │
│ [Term Vectors] ──► Per-document Inverted Index │
│ [Norms] ──► Normalization factors for Scoring │
│ [Live Documents] ──► Bitset for Deletions (.del file) │
│ [Points/Dimensions] ──► BKD Tree for Numeric/Geo spatial data │
└─────────────────────────────────────────────────────────────────────────────┘

The Power of Immutability

In most databases, if you update a row, the engine modifies it in place. Lucene does things differently: segments are immutable. When you update a document, Lucene doesn't change the old one. Instead, it performs an append-only operation. It marks the old document as deleted using a bitset operation and then inserts the updated version into a new segment. Because we are constantly creating new segments, Lucene performs segment merging in the background. This prevents the number of segments from exploding, which would otherwise make the parallelism required for searching too resource-intensive.

You might wonder: If Lucene is reshuffling data in the background during a merge, how do searches stay consistent? Lucene uses reference counters. When a query starts, it identifies exactly which segments it needs to touch. If a merge happens mid-query, Lucene maintains a shadow representation of the old segments on disk. The active query finishes using the old "shadow" segments, while any new queries are redirected to the newly merged segment. Once the reference counter for the old segment hits zero, it’s finally deleted.

┌─────────────────────────────────────────────────────────────────────────────┐
│ LUCENE SEGMENT MERGING (Maintenance) │
├─────────────────────────────────────────────────────────────────────────────┤
│ 1. The Trigger (Merge Policy) │
│ [Tiered Policy] ──► Monitors segment count, sizes, and delete % │
│ [Threshold] ──► Triggered when too many small segments accumulate │
│ [Goal] ──► Maintain a logarithmic number of segments │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ 2. The Selection Phase │
│ [Candidates] ──► Picks N small segments (often similar in size) │
│ [Exclusions] ──► Extremely large segments are often left alone │
│ [Deletions] ──► Prioritizes segments with many "marked" deletes │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ 3. The Execution (Compact & Purge) │
│ [New Segment] ──► A fresh, larger segment is built from candidates │
│ [Data Transfer] ──► Re-indexes Inverted Index, BKD Trees, & DocValues │
│ [Purge] ──► Documents marked in .del files are NOT copied │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ 4. The Switchover (Atomic Commit) │
│ [Warm-up] ──► New large segment is fsync'd and opened │
│ [Atomic Swap] ──► Shard metadata updates to point to the new segment │
│ [Cleanup] ──► Old small segment files are deleted from disk │
└─────────────────────────────────────────────────────────────────────────────┘

The Write Path: Translog vs. Memory Buffer

When you write to ES, it takes two parallel paths to ensure both speed and durability:

Translog (Write Ahead Log): An immutable, append-only record written directly to disk. This ensures that even if the system crashes, your data is persisted.
Internal Memory Buffer: Simultaneously, data is written to a buffer so it can be searched almost immediately, even before it is officially persisted into a disk-based Lucene segment.

┌─────────────────────────────────────────────────────────────────────────────┐
│ ELASTICSEARCH WRITE PATH (Logical Flow) │
├─────────────────────────────────────────────────────────────────────────────┤
│ 1. Ingestion Point (Primary Shard) │
│ [Document Input] ──┬──► [In-Memory Indexing Buffer] │
│ └──► [Translog (Transaction Log / WAL)] │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ 2. The Refresh (Searchability - Default: 1s) │
│ [Indexing Buffer] ──► [New Lucene Segment Creation] │
│ [Segment Files] ──► [OS Filesystem Cache (RAM)] │
│ [Status] ──► DATA BECOMES SEARCHABLE │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ 3. The Flush (Durability - Default: 30m or 512MB) │
│ [FSCache Segments]──► [Lucene Commit (fsync to Physical Disk)] │
│ [Translog Status] ──► [Purge/Trim Old Log Entries] │
│ [Status] ──► DATA IS HARD-PERSISTED │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ 4. Background Maintenance │
│ [Merge Policy] ──► Combine Small Segments (.seg) into Large Ones │
│ [Cleanup] ──► Reclaim space from Deleted Docs (.del markers) │
│ [Structure] ──► Build/Update BKD Trees & Inverted Index │
└─────────────────────────────────────────────────────────────────────────────┘

Practical Takeaways: Denormalisation vs. Normalisation

In a recent project, we were storing data in a denormalised format—everything in one document. This is fantastic for read performance because the entire block is fetched at once. However, if you have large documents (e.g., 1MB) that update frequently, you’ll put massive pressure on the JVM memory and disk I/O because you’re constantly creating new 1MB segments for every small update. In those cases, using a normalised format or a parent-child relationship might save your memory, though you’ll pay a 5x to 10x cost in query performance because you'll have to fire multiple queries and correlate them.

┌───────────────────────────────────────────────────────────────────────────────────┐
│ ELASTICSEARCH RELATIONSHIP PERFORMANCE BENCHMARK (2026) │
├───────────────────────────────────────────────────────────────────────────────────┤
│ │
│ READ SPEED (Query Throughput) │
│ FAST ◄──────────────────────────────────────────────────────────────► SLOW │
│ ┌──────────────────────┐ ┌──────────────────────┐ ┌─────────────────┐ │
│ │ DENORMALIZED │ │ NESTED FIELDS │ │ JOIN RELATION │ │
│ │ (Flat Documents) │ │ (Hidden Sub-Docs) │ │ (Parent-Child) │ │
│ └──────────────────────┘ └──────────────────────┘ └─────────────────┘ │
│ [1x] [2x - 5x] [5x - 10x+] │
│ │
├───────────────────────────────────────────────────────────────────────────────────┤
│ WRITE SPEED (Indexing Latency) │
│ FAST ◄──────────────────────────────────────────────────────────────► SLOW │
│ ┌──────────────────────┐ ┌──────────────────────┐ ┌─────────────────┐ │
│ │ JOIN RELATION │ │ DENORMALIZED │ │ NESTED FIELDS │ │
│ │ (Independent Docs) │ │ (Full Doc Update) │ │ (Mapping Bloat) │ │
│ └──────────────────────┘ └──────────────────────┘ └─────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────────────────┘

Conclusion

Whether it’s the Master node managing cluster state, the Data node holding your segments, or the Coordinating node acting as a router for "scatter and gather" operations, every part of ES is designed for scale. Understanding these internals isn't just academic—it directly impacts how you should design your index structures for your next project. Happy learning!