Back in 2016 or 2017, I first encountered a problem where our system just wasnβt handling the load we were throwing at it. That was my first real "deep dive" into Elasticsearch (ES). I was looking for a way to distribute load across clusters, and while ES handled the load beautifully, it also opened my eyes to the beautiful complexity of how data is actually laid out under the hood. If youβve ever wondered why your write patterns affect performance or how ES manages to be so fast, you have to look at Lucene segments. Letβs break down the internals.
The Anatomy: Shards, Indices, and Segments
To understand ES, you first have to understand its relationship with Lucene. In the ES world, an index is like a table in Postgres. But physically, an index is made up of shards. A shard is the functional, scalable unit of your dataβitβs a physical container that, internally, is actually a complete Lucene index. Because itβs a full Lucene index, it can conduct searches independently without needing extra metadata from other sources. If we go one level deeper, a Lucene index is made of multiple segments. For Lucene, the segment is the most atomic and granular unit of the data store.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ELASTICSEARCH SHARD ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 1. The Shard Unit (Physical Container) β
β [Storage] βββΊ A directory on disk containing Lucene files β
β [Immutability] βββΊ Composed of multiple immutable Segments (.seg) β
β [Scaling] βββΊ Smallest unit moved during Cluster Rebalancing β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2. Primary Shard (The Writer) β
β [Write Path] βββΊ Validates request βββΊ Buffers βββΊ Syncs Replicas β
β [Sequencing] βββΊ Assigns sequence numbers for consistency β
β [Status] βββΊ One per shard group; Must be active for writes β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 3. Replica Shards (The Readers) β
β [Redundancy] βββΊ Exact copies of the Primary on different nodes β
β [Read Throughput]βββΊ Parallelizes search queries across multiple nodes β
β [Failover] βββΊ Promoted to Primary if the original node fails β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 4. Internal Shard Components β
β [Inverted Index] βββΊ Text search engine (Terms βββΊ Docs) β
β [BKD Tree] βββΊ Numeric/Geo spatial index β
β [Global Checkpt] βββΊ Tracks synchronization state between replicas β
β [Translog] βββΊ Local WAL for recovering uncommitted segments β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Multiple Representations of Data
One of the coolest things about Lucene is that it doesnβt just store your data once. It stores it in multiple formats within a segment to support different query types: β’ Inverted Index: The classic search engine structure (popularised by Google) used to find terms across documents. β’ Doc Values: A columnar store used for aggregations (like calculating totals or bucketing data). β’ BKD Trees: K-dimensional trees used for complex geospatial or multidimensional searches.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LUCENE SEGMENT (Immutable) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 1. Inverted Index (Search Core) β
β [Term Dictionary] βββΊ [Term Index (FST)] β
β [Postings Lists] βββΊ {DocID, TermFreq, Positions, Offsets, Payloads} β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2. Stored Fields (Document Storage) β
β [Field Index (.fdx)] βββΊ Pointer to Document Row β
β [Field Data (.fdt)] βββΊ {Field1, Field2, ...} (Row-based storage) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 3. DocValues (Columnar Storage) β
β [Field A] βββΊ [Val 1, Val 2, Val 3, ...] (Optimized for Sorting/Aggr) β
β [Field B] βββΊ [Val 1, Val 2, Val 3, ...] β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 4. Metadata & Auxiliary Structures β
β [Term Vectors] βββΊ Per-document Inverted Index β
β [Norms] βββΊ Normalization factors for Scoring β
β [Live Documents] βββΊ Bitset for Deletions (.del file) β
β [Points/Dimensions] βββΊ BKD Tree for Numeric/Geo spatial data β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The Power of Immutability
In most databases, if you update a row, the engine modifies it in place. Lucene does things differently: segments are immutable. When you update a document, Lucene doesn't change the old one. Instead, it performs an append-only operation. It marks the old document as deleted using a bitset operation and then inserts the updated version into a new segment. Because we are constantly creating new segments, Lucene performs segment merging in the background. This prevents the number of segments from exploding, which would otherwise make the parallelism required for searching too resource-intensive.
You might wonder: If Lucene is reshuffling data in the background during a merge, how do searches stay consistent? Lucene uses reference counters. When a query starts, it identifies exactly which segments it needs to touch. If a merge happens mid-query, Lucene maintains a shadow representation of the old segments on disk. The active query finishes using the old "shadow" segments, while any new queries are redirected to the newly merged segment. Once the reference counter for the old segment hits zero, itβs finally deleted.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LUCENE SEGMENT MERGING (Maintenance) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 1. The Trigger (Merge Policy) β
β [Tiered Policy] βββΊ Monitors segment count, sizes, and delete % β
β [Threshold] βββΊ Triggered when too many small segments accumulate β
β [Goal] βββΊ Maintain a logarithmic number of segments β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2. The Selection Phase β
β [Candidates] βββΊ Picks N small segments (often similar in size) β
β [Exclusions] βββΊ Extremely large segments are often left alone β
β [Deletions] βββΊ Prioritizes segments with many "marked" deletes β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 3. The Execution (Compact & Purge) β
β [New Segment] βββΊ A fresh, larger segment is built from candidates β
β [Data Transfer] βββΊ Re-indexes Inverted Index, BKD Trees, & DocValues β
β [Purge] βββΊ Documents marked in .del files are NOT copied β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 4. The Switchover (Atomic Commit) β
β [Warm-up] βββΊ New large segment is fsync'd and opened β
β [Atomic Swap] βββΊ Shard metadata updates to point to the new segment β
β [Cleanup] βββΊ Old small segment files are deleted from disk β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The Write Path: Translog vs. Memory Buffer
When you write to ES, it takes two parallel paths to ensure both speed and durability:
- Translog (Write Ahead Log): An immutable, append-only record written directly to disk. This ensures that even if the system crashes, your data is persisted.
- Internal Memory Buffer: Simultaneously, data is written to a buffer so it can be searched almost immediately, even before it is officially persisted into a disk-based Lucene segment.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ELASTICSEARCH WRITE PATH (Logical Flow) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 1. Ingestion Point (Primary Shard) β
β [Document Input] βββ¬βββΊ [In-Memory Indexing Buffer] β
β ββββΊ [Translog (Transaction Log / WAL)] β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2. The Refresh (Searchability - Default: 1s) β
β [Indexing Buffer] βββΊ [New Lucene Segment Creation] β
β [Segment Files] βββΊ [OS Filesystem Cache (RAM)] β
β [Status] βββΊ DATA BECOMES SEARCHABLE β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 3. The Flush (Durability - Default: 30m or 512MB) β
β [FSCache Segments]βββΊ [Lucene Commit (fsync to Physical Disk)] β
β [Translog Status] βββΊ [Purge/Trim Old Log Entries] β
β [Status] βββΊ DATA IS HARD-PERSISTED β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 4. Background Maintenance β
β [Merge Policy] βββΊ Combine Small Segments (.seg) into Large Ones β
β [Cleanup] βββΊ Reclaim space from Deleted Docs (.del markers) β
β [Structure] βββΊ Build/Update BKD Trees & Inverted Index β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Practical Takeaways: Denormalisation vs. Normalisation
In a recent project, we were storing data in a denormalised formatβeverything in one document. This is fantastic for read performance because the entire block is fetched at once. However, if you have large documents (e.g., 1MB) that update frequently, youβll put massive pressure on the JVM memory and disk I/O because youβre constantly creating new 1MB segments for every small update. In those cases, using a normalised format or a parent-child relationship might save your memory, though youβll pay a 5x to 10x cost in query performance because you'll have to fire multiple queries and correlate them.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ELASTICSEARCH RELATIONSHIP PERFORMANCE BENCHMARK (2026) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β READ SPEED (Query Throughput) β
β FAST ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊ SLOW β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ β
β β DENORMALIZED β β NESTED FIELDS β β JOIN RELATION β β
β β (Flat Documents) β β (Hidden Sub-Docs) β β (Parent-Child) β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ β
β [1x] [2x - 5x] [5x - 10x+] β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β WRITE SPEED (Indexing Latency) β
β FAST ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΊ SLOW β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ β
β β JOIN RELATION β β DENORMALIZED β β NESTED FIELDS β β
β β (Independent Docs) β β (Full Doc Update) β β (Mapping Bloat) β β
β ββββββββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Conclusion
Whether itβs the Master node managing cluster state, the Data node holding your segments, or the Coordinating node acting as a router for "scatter and gather" operations, every part of ES is designed for scale. Understanding these internals isn't just academicβit directly impacts how you should design your index structures for your next project. Happy learning!