Gaurav Sarma

You have a three-node database cluster running in production. A network switch fails and the nodes can no longer talk to each other. Both sides of the partition think the other side is dead. Both promote themselves to primary. Both start accepting writes. When the network heals, you have two divergent histories of your data and no automatic way to reconcile them. This is split brain, and it has caused real outages at every scale, from startups to GitHub to the entire AWS us-east-1 region.

The Problem

Distributed databases replicate data across multiple nodes to survive hardware failures. In normal operation, one node (the primary, leader, or master) accepts writes and replicates them to followers. If the primary fails, a follower takes over. This is straightforward when failures are clean: the primary crashes, followers detect it, one gets elected.

The problem is that real failures are not clean. Network partitions do not announce themselves. From node A's perspective, node B might be dead, unreachable, or perfectly healthy but separated by a failed switch. Node A cannot tell the difference. If both sides of a partition independently decide "I am the primary now," you get split brain: two nodes accepting conflicting writes to the same data.

The consequences are severe. Imagine two clients updating the same bank account balance on two different primaries. When the partition heals, which balance is correct? The answer is neither, both, or "it depends on your conflict resolution strategy," none of which inspire confidence in a financial system.

Split brain is not a theoretical concern. It is the failure mode that motivates most of the complexity in distributed consensus protocols.

Prerequisites

Understanding of primary/replica (leader/follower) replication at a conceptual level
Basic familiarity with what a network partition is
Awareness of CAP theorem helps but is not required
Knowing what a distributed consensus protocol does (not how it works internally)

Technical Decisions

Why Not Just Use Timeouts?

The naive failure detector is a heartbeat with a timeout: if the primary does not respond within N seconds, declare it dead and elect a new one. This is exactly what causes split brain.

The primary might be alive but slow. A garbage collection pause, a saturated network link, or a CPU spike can all cause heartbeat delays without the node actually failing. If the timeout is too aggressive, you get false positives: a new primary is elected while the old one is still running and accepting writes. If the timeout is too conservative, you get long periods of unavailability while the system waits to be sure the primary is gone.

There is no timeout value that eliminates split brain. The fundamental issue is that failure detection in an asynchronous distributed system is inherently uncertain. This is the insight behind the FLP impossibility result: in an asynchronous system, you cannot distinguish a crashed process from a slow one.

Why Quorum-Based Approaches Win

The key insight that prevents split brain is: do not let any single node make unilateral decisions. Instead, require a majority (quorum) of nodes to agree before any state change takes effect.

In a cluster of N nodes, a quorum is floor(N/2) + 1. For a three-node cluster, the quorum is 2. For five nodes, it is 3. The critical property: if the cluster splits into two groups, at most one group can contain a majority. The minority side literally cannot form a quorum, so it cannot elect a leader or commit writes.

This is why distributed databases run on odd numbers of nodes. A two-node cluster has a quorum of 2, meaning both nodes must agree for anything to happen. A network partition between them brings the entire cluster down, which defeats the purpose. A three-node cluster only needs two nodes, so it survives a single-node failure or partition.

Implementation

Phase 1: How Split Brain Actually Happens

Let us trace through the failure step by step with a three-node PostgreSQL cluster using streaming replication and a failover manager like Patroni.

Normal operation:
  Node A (primary) ──replication──> Node B (sync replica)
                   ──replication──> Node C (async replica)

  Clients ──writes──> Node A

A network partition isolates Node A from Nodes B and C, but B and C can still talk to each other:

After partition:
  [Partition A]          |          [Partition B]
  Node A (thinks it's    |    Node B ──── Node C
   still primary)        |    (elect new primary?)
                         |
  Clients on this side   |    Clients on this side
  still writing to A     |    can't reach A

Node B and Node C detect that Node A's heartbeat has stopped. After the configured timeout, they initiate a leader election. Node B wins. It promotes itself to primary and starts accepting writes.

Meanwhile, Node A has no idea this happened. It never received a "you are no longer primary" message because the network is partitioned. It continues accepting writes from clients on its side of the partition.

Now both Node A and Node B are accepting writes. You have split brain.

Split brain state:
  Node A: INSERT INTO orders (id, amount) VALUES (1001, 50.00);
  Node B: INSERT INTO orders (id, amount) VALUES (1001, 99.00);

  Same primary key, different data. Which one is right?

Phase 2: Prevention with Fencing

The first line of defense is fencing: ensuring the old primary cannot accept writes after a new primary is elected. There are several mechanisms:

STONITH (Shoot The Other Node In The Head)

The most aggressive approach. When the new primary is elected, it sends a hardware-level command to power off the old primary. This is common in traditional HA clusters using Pacemaker/Corosync.

# Pacemaker fencing agent example
stonith_admin --fence node-a
# This physically powers off node-a via IPMI/iLO/DRAC

STONITH is effective but brutal. It requires out-of-band management hardware (IPMI, iLO) and introduces its own failure modes: what if the fencing command itself fails to reach the old primary?

Fencing Tokens (Logical Fencing)

A more elegant approach used by systems like ZooKeeper and etcd. Each leader election produces a monotonically increasing token (epoch number, term number, or lease version). Every write request must include the current fencing token. Storage systems reject writes with stale tokens.

Leader election 1: Node A gets token 42
Leader election 2: Node B gets token 43

Node A sends: WRITE(key=balance, value=50, token=42)
Storage sees token 42 < current token 43 → REJECTED

Node B sends: WRITE(key=balance, value=99, token=43)
Storage sees token 43 = current token 43 → ACCEPTED

This works even if Node A is still alive and thinks it is the primary. The storage layer enforces the invariant that only the most recently elected leader's writes are accepted. The old leader's writes silently fail.

Lease-Based Fencing

The primary holds a time-limited lease. It can only accept writes while the lease is valid. To renew the lease, it must contact a quorum. If it is partitioned from the quorum, its lease expires and it stops accepting writes.

Timeline:
  T=0:  Node A acquires lease (valid for 10s)
  T=5:  Network partition happens
  T=8:  Node A tries to renew lease, cannot reach quorum
  T=10: Lease expires, Node A stops accepting writes
  T=12: Node B acquires new lease from quorum, becomes primary

The gap between T=10 and T=12 is intentional unavailability. The system chooses to be unavailable rather than risk split brain. This is the CP side of the CAP theorem in practice.

Phase 3: Prevention with Consensus Protocols

Modern distributed databases avoid split brain by design using consensus protocols. The two most widely deployed are Raft and Multi-Paxos.

Raft (used by etcd, CockroachDB, TiKV, Consul)

In Raft, every write must be replicated to a majority of nodes before it is considered committed. A leader that is partitioned from the majority cannot commit any writes because it cannot get quorum acknowledgment.

Normal write in Raft (3-node cluster):

  Client ──write──> Leader (Node A)
  Node A ──AppendEntries──> Node B  ✓ (ACK)
  Node A ──AppendEntries──> Node C  ✓ (ACK)
  Quorum reached (2/3 including leader): COMMIT

After partition (A isolated):

  Client ──write──> Leader (Node A)
  Node A ──AppendEntries──> Node B  ✗ (unreachable)
  Node A ──AppendEntries──> Node C  ✗ (unreachable)
  Cannot reach quorum: WRITE BLOCKS / TIMES OUT

Meanwhile, B and C elect a new leader with a higher term:
  Node B becomes leader (term 2)
  Node B can reach Node C → quorum of 2/3 → writes succeed

When the partition heals, Node A discovers that a new leader with a higher term exists. It steps down, discards any uncommitted entries in its log, and replicates from Node B. No split brain, by construction.

The critical invariant in Raft is: a leader must have been elected by a majority, and every committed entry must be stored on a majority. Since any two majorities overlap in at least one node, a new leader is guaranteed to know about all previously committed entries.

Multi-Paxos (used by Google Spanner, variations in many systems)

Multi-Paxos works on a similar quorum principle but separates the concern differently. A leader is elected via a Paxos round (Phase 1), and then can issue writes without repeating Phase 1 for each operation (Phase 2 only). If the leader is partitioned, its Phase 2 messages will not reach a quorum, and a new leader will be elected via a new Phase 1 round with a higher ballot number.

The math is the same: two quorums always overlap, so you cannot have two leaders that can both commit writes.

Phase 4: Recovery After Split Brain

Despite all prevention mechanisms, split brain can still happen in practice, especially in systems that prioritize availability over consistency (AP systems). When it does, you need a recovery strategy.

Last-Writer-Wins (LWW)

The simplest approach: attach a timestamp to every write, and when conflicts are detected, keep the write with the latest timestamp.

Node A: SET balance = 50  (timestamp: 1711872000001)
Node B: SET balance = 99  (timestamp: 1711872000002)

After merge: balance = 99 (higher timestamp wins)

This is simple but dangerous. It silently discards writes. If Node A processed a deposit and Node B processed a withdrawal, you just lost the deposit. DynamoDB and Cassandra both support LWW, but the documentation is very clear about the trade-off.

Clock skew makes LWW even worse. If Node A's clock is ahead, its writes always win regardless of when they actually happened. This is why Spanner uses TrueTime (GPS-synchronized clocks with bounded uncertainty) instead of relying on system clocks.

CRDTs (Conflict-free Replicated Data Types)

CRDTs are data structures designed so that concurrent updates can always be merged without conflicts. A G-Counter (grow-only counter), for example, tracks increments per node and sums them on read:

Node A counter: {A: 5, B: 0}  (Node A saw 5 increments)
Node B counter: {A: 0, B: 3}  (Node B saw 3 increments)

Merged: {A: 5, B: 3} → total = 8

No data is lost, but CRDTs only work for data structures that have a natural merge operation. A counter merges easily. A bank account balance does not, because you need to enforce constraints (balance >= 0) that require coordination.

Riak was the most prominent database to build around CRDTs. Redis also supports CRDT-based conflict resolution in its active-active geo-replication.

Application-Level Resolution

Some systems punt the problem to the application. CouchDB stores all conflicting revisions and lets the application decide which one to keep. This is maximally flexible but puts the burden on the developer, and in practice many applications simply pick a winner arbitrarily, which is LWW with extra steps.

Phase 5: Rollback Mechanics After Split Brain

Conflict resolution picks a winner. Rollback is the harder problem: undoing the loser's writes without corrupting the data that survived. The mechanics differ significantly between systems.

Raft Log Truncation

In Raft-based systems, rollback is baked into the protocol. When a partitioned leader (Node A, term 1) rejoins the cluster and discovers a new leader (Node B, term 2), it compares logs. Any entries in Node A's log that are not present in Node B's log (the authoritative leader) are uncommitted by definition, because they never reached a quorum. Node A truncates its log back to the point where it diverges from Node B's log, then replays Node B's entries forward.

Node A log (stale leader, term 1):
  [1:1] [1:2] [1:3] [1:4] [1:5]
                      ↑ diverges here

Node B log (current leader, term 2):
  [1:1] [1:2] [1:3] [2:1] [2:2] [2:3]

After rollback on Node A:
  [1:1] [1:2] [1:3] [2:1] [2:2] [2:3]
                      ↑ entries [1:4] and [1:5] are discarded

The key safety property: entries [1:4] and [1:5] were never committed (never ACKed to clients as durable), so discarding them does not violate any promise the system made. Clients that sent those writes received timeouts or errors, not success responses. This is why Raft-based systems only acknowledge a write after quorum replication, never before.

In CockroachDB, this log truncation happens at the Raft layer, but there is an additional concern: those uncommitted writes may have partially applied side effects in the storage engine (RocksDB/Pebble). CockroachDB handles this with its MVCC (multi-version concurrency control) layer. Uncommitted writes exist as intents, which are cleaned up during the rollback process. No committed data is affected.

PostgreSQL: Timeline Divergence and pg_rewind

PostgreSQL does not use a consensus protocol for replication. When split brain happens in a PostgreSQL HA cluster (two nodes both acting as primary), the divergence is at the WAL (write-ahead log) level. Both nodes generated WAL records from the same starting point but with different content.

After the partition heals, the old primary cannot simply reconnect as a replica. Its WAL has diverged, it has data pages on disk that reflect writes the new primary never saw. You have three options:

Rebuild from scratch: pg_basebackup the entire database from the new primary. Safe but slow, especially for large databases (hours for terabyte-scale).
pg_rewind: A targeted rollback tool. It reads the new primary's WAL to find the exact point of divergence, then copies only the changed data pages from the new primary to the old one. The old primary's divergent WAL is discarded.

# On the old primary (Node A), after it has been stopped:
pg_rewind --target-pgdata=/var/lib/postgresql/data \
          --source-server="host=node-b port=5432 user=rewind_user"

# pg_rewind does:
# 1. Finds the timeline divergence point in the WAL
# 2. Reads all WAL records on the new primary since divergence
# 3. Identifies which data pages were modified
# 4. Copies those pages from the new primary to the old one
# 5. Old primary can now start as a replica of Node B

The critical requirement for pg_rewind is that wal_log_hints or data_checksums must be enabled. Without these, pg_rewind cannot reliably identify which pages changed. Patroni enables wal_log_hints by default for exactly this reason.

Manual WAL inspection: In the worst case, a DBA can use pg_waldump to inspect the divergent WAL records on both sides, identify what writes were lost, and manually reconcile them. This is a last resort, but it is sometimes the only option when the lost writes had real-world side effects (emails sent, payments initiated).

# Inspect divergent WAL on the old primary
pg_waldump /var/lib/postgresql/data/pg_wal/000000020000000000000042 \
  --start=0/4200000 --end=0/4300000

# Output shows individual record types:
# rmgr: Heap    len: 54  tx: 1234  INSERT off 3 blk 0: rel 1663/16384/16385
# rmgr: Btree   len: 64  tx: 1234  INSERT_LEAF off 42 blk 0: rel 1663/16384/16389

MySQL/MariaDB with GTID-Based Rollback

MySQL's Global Transaction Identifiers (GTIDs) make divergence detection straightforward. Each transaction gets a unique ID in the format server_uuid:sequence_number. After split brain, the two primaries have GTID sets that diverged:

Node A GTID set: uuid-a:1-100, uuid-b:1-50
  (Node A originated transactions 1-100, replicated B's 1-50 before split)

Node B GTID set: uuid-a:1-80, uuid-b:1-70
  (Node B only saw A's first 80, then originated its own 51-70)

Divergent on Node A: uuid-a:81-100 (writes A made during partition)
Divergent on Node B: uuid-b:51-70  (writes B made during partition)

To roll back Node A and rejoin it as a replica of Node B, you need to undo transactions uuid-a:81-100. MySQL does not have a built-in "undo these GTIDs" command. The options are:

mysqlbinlog with --exclude-gtids: Extract the divergent binlog events, generate reverse SQL statements, and apply them. Tools like gh-ost or pt-online-schema-change can help, but this is manual and error-prone.
Clone plugin: MySQL 8.0+ can clone a fresh copy of the data from the new primary, similar to pg_basebackup. Faster than a full dump/restore but still requires downtime on the rejoining node.
Group Replication automatic rollback: If you are using MySQL Group Replication (InnoDB Cluster) instead of async replication, the rejoining node automatically rolls back divergent transactions using the group_replication_applier channel. This is the closest MySQL gets to Raft-style automatic rollback.

Cassandra: Rollback by Convergence

Cassandra does not roll back in the traditional sense. As an AP system, it accepts that both sides of a split brain produced valid writes. Instead of picking a winner and discarding the loser, it converges through read repair and anti-entropy repair:

Read repair: When a client reads a key, the coordinator queries multiple replicas. If they disagree, the most recent value (by timestamp) wins, and stale replicas are updated in the background.
Anti-entropy repair (nodetool repair): A background process that compares Merkle trees of data ranges across replicas and synchronizes any differences.

The "rollback" in Cassandra is really "eventual overwrite." Old values are not explicitly undone. They are superseded by newer values during the repair process. Tombstones (deletion markers) ensure that deletes on one side of the partition are not undone by stale reads from the other side.

During partition:
  Node A: DELETE FROM users WHERE id = 42;  (tombstone at T=100)
  Node B: SELECT * FROM users WHERE id = 42; → returns row (stale)

After partition heals + read repair:
  Tombstone (T=100) > row's last write (T=90)
  → DELETE wins, row is removed from Node B
  → Without tombstones, the delete would be "resurrected"

This is why Cassandra has gc_grace_seconds (default 10 days): tombstones must survive long enough for all replicas to see them. If a node is down for longer than gc_grace_seconds, tombstones may be garbage collected before that node sees them, and deleted data can reappear. This is one of the most common operational surprises in Cassandra.

How It All Fits Together

The defenses against split brain form layers:

Layer 1: Consensus Protocol (Raft, Paxos)
  → Prevents split brain by requiring quorum for all commits
  → A partitioned leader cannot commit writes

Layer 2: Fencing (tokens, leases, STONITH)
  → Prevents stale leaders from interacting with storage
  → Even if consensus has a bug, the storage layer rejects stale writes

Layer 3: Conflict Resolution (LWW, CRDTs, app-level merge)
  → Handles the aftermath if split brain occurs despite layers 1 and 2
  → Trade-offs between simplicity, correctness, and data loss

CP systems (etcd, ZooKeeper, CockroachDB, Spanner) invest heavily in layers 1 and 2 and aim to never reach layer 3. They accept temporary unavailability during partitions as the cost of avoiding split brain.

AP systems (Cassandra, DynamoDB, Riak) accept that split brain will happen during partitions and invest in layer 3. They remain available but require careful application design to handle conflicts.

The choice between these is not a technical one. It is a product decision: is it worse for your users to see stale or conflicting data, or to see an error page? For a shopping cart, stale data is fine. For a wire transfer, an error page is the only safe option.

Lessons Learned

Split brain is a spectrum, not a binary. Partial partitions, where some nodes can reach some but not all other nodes, create scenarios that are harder to reason about than a clean two-way split. The "Byzantine" failure modes (nodes lying about their state) are even harder. Most production systems only handle crash-stop failures and clean partitions.

Testing split brain is harder than preventing it. You can reason about Raft's correctness on paper, but you also need to verify that your specific implementation handles edge cases: clock skew, disk full, partial network failures, and leader elections during compaction. Tools like Jepsen have found split-brain bugs in almost every distributed database they have tested, including etcd, CockroachDB, and MongoDB.

Monitoring matters as much as prevention. If split brain does happen, fast detection limits the damage. Track metrics like the number of active leaders (should always be 0 or 1), replication lag across replicas, and fencing token monotonicity. Alert on any of these violating expectations.

Operator error causes more split brain than software bugs. Misconfigured timeouts, manual failovers without proper fencing, and "temporary" firewall rules that partition the cluster are far more common than actual consensus protocol bugs. The most common cause of split brain in PostgreSQL HA setups is someone manually promoting a replica without first shutting down the old primary.

What's Next

If you want to go deeper, Jepsen's analysis reports are the gold standard for understanding how real distributed databases handle (or fail to handle) partitions. The Raft paper by Ongaro and Ousterhout is surprisingly readable and covers the leader election and log replication mechanisms in enough detail to implement them. For a more formal treatment, Lamport's "Paxos Made Simple" is the canonical reference, though "simple" is doing heavy lifting in that title.

How Split Brain Happens in Distributed Databases and How It Gets Fixed