Distributed Systems: Consensus

Consensus is the mechanism that lets multiple machines act like one reliable system. In a distributed environment, individual nodes can crash, messages can be delayed or reordered, and networks can split. Without a disciplined way to agree on what happened and in what order, the system cannot safely replicate data, elect leaders, or coordinate actions. Agreement protocols solve that problem by ensuring that correct nodes converge on the same decision, even when parts of the system misbehave.

At a high level, consensus is about turning uncertainty into a single, durable outcome: one leader, one committed log entry, one configuration, one view of membership. The difficulty is not in computing the decision, but in doing so under partial failure and asynchronous communication.

What “Consensus” Means in Practice

A consensus protocol typically aims to provide three core properties:

Safety: nodes do not decide conflicting values. If one node commits entry X at position i, another correct node will not commit a different entry Y at that same position.
Liveness: the system eventually makes progress if the environment cooperates (for example, the network eventually delivers messages and a majority of nodes remain up).
Fault tolerance: the system continues operating despite failures. Most common protocols assume crash faults (a node stops), while some handle Byzantine faults (a node can lie, equivocate, or act maliciously).

In many real systems, consensus is used to implement state machine replication: each node runs the same deterministic state machine and applies the same ordered sequence of commands. If the sequence is identical, the resulting state is identical. Consensus provides the ordered log of commands.

Why Consensus Is Hard: Failures and Time

Distributed systems lack a shared clock and cannot reliably distinguish “slow” from “dead.” If a node does not respond, it might have crashed, or it might simply be partitioned behind a congested link. Any protocol must make decisions using imperfect information.

This reality shapes the standard assumptions behind consensus algorithms:

Messages can be delayed, duplicated, or reordered.
Nodes can crash and restart.
The system may experience network partitions.
Clocks drift and timeouts are heuristics, not truth.

Consensus algorithms accept these constraints and design around them, usually by requiring some form of quorum agreement.

Quorums and Majorities: The Backbone of Crash-Fault Consensus

Most crash-fault consensus protocols rely on majority quorums. In a cluster of $N$ nodes, a majority is typically $⌊ N /2 ⌋ + 1$ . Any two majorities intersect, which ensures that decisions are not made by disjoint groups that could later conflict.

That intersection property is the key to safety. If an entry is committed by a majority, at least one node that knows about that commitment will be present in any future majority, preventing the system from “forgetting” committed history.

This is why many replicated databases and coordination systems run with odd numbers of nodes (3, 5, 7): it maximizes tolerance to crashes for a given cost. A 3-node cluster can tolerate 1 crash, a 5-node cluster can tolerate 2, and so on.

Paxos: The Canonical Consensus Algorithm

Paxos is often described as the “assembly language” of consensus: foundational, powerful, and easy to misunderstand at first glance. It solves the problem of agreeing on a single value (single-decree Paxos), and extensions like Multi-Paxos support agreeing on a sequence of values, which is what replicated logs require.

Paxos revolves around two roles and a disciplined voting process:

Proposers attempt to get a value chosen.
Acceptors vote on proposals and ensure safety by following strict rules.

The protocol uses numbered proposals and typically runs in two phases:

Prepare/Promise: a proposer asks acceptors to promise not to accept proposals below a certain number.
Accept/Accepted: the proposer asks acceptors to accept a value, constrained by any prior accepted values discovered in phase one.

The design ensures that once a value is chosen, no other conflicting value can later be chosen, even if leadership changes mid-flight. In practice, Multi-Paxos optimizes the common case by establishing a stable leader that can propose successive log entries without repeating the full two-phase handshake every time.

Paxos is widely studied and deployed, but many teams choose alternatives because Paxos can be difficult to implement and reason about during maintenance.

Raft: Consensus Designed for Understandability

Raft targets the same goal as Paxos for crash-fault consensus, but emphasizes clarity in both the protocol and its implementation. Raft is commonly used to build replicated logs and is well-known in systems like etcd and Consul.

Raft’s design centers around a strongly defined structure:

Leader election: nodes transition among follower, candidate, and leader roles. Timeouts trigger elections when a leader is suspected dead.
Log replication: the leader appends commands to its log and replicates them to followers. An entry is committed once replicated to a majority.
Safety restrictions: rules ensure leaders have up-to-date logs and prevent committed entries from being overwritten.

A practical advantage of Raft is the operational story it supports. It provides explicit concepts like terms, heartbeats, and leader leases (in some implementations) that map well to debugging realities: administrators can often explain what happened during a failure by inspecting term changes and commit indexes.

Byzantine Fault Tolerance: When Nodes Can Lie

Crash faults assume nodes either respond correctly or stop responding. In some environments, that is not enough. If nodes can behave arbitrarily, intentionally or not, consensus requires Byzantine fault tolerance (BFT).

Byzantine fault tolerant protocols must defend against behaviors like:

sending conflicting messages to different nodes (equivocation)
fabricating votes
selectively ignoring requests
replaying old messages

BFT systems typically require stronger quorums. A common threshold is that to tolerate $f$ Byzantine nodes, the system needs $N \geq 3 f + 1$ replicas. That overhead is one reason BFT is used selectively, often in settings where adversarial behavior is a real concern.

BFT is not just “Paxos with signatures.” The possibility of lying changes the protocol structure, message patterns, and verification requirements. In exchange for extra cost, BFT can provide stronger correctness guarantees in hostile conditions.

CAP Theorem: Consensus Under Partitions

The CAP theorem frames a central tradeoff in distributed systems: in the presence of a network partition, a system must choose between consistency and availability. Partitions are not hypothetical; they happen due to network failures, misconfigurations, and overloaded links.

Consensus protocols generally prioritize consistency during partitions by requiring quorum agreement. If the cluster splits into two groups, only the side with a majority can continue committing new entries. The minority side may remain online, but it cannot safely make progress on the replicated log without risking divergence.

This behavior is often misunderstood as “downtime,” but it is a deliberate safety choice. A system that continues accepting writes on both sides of a partition must either reconcile conflicting histories later or abandon strong consistency.

Consistency Models: What the Application Actually Gets

Consensus is one building block in a larger story about consistency models. Even when a system uses consensus internally, the external behavior depends on how reads and writes are served.

Key ideas include:

Linearizability (strong consistency): operations appear to occur atomically in a single global order consistent with real time. Many consensus-backed systems aim for linearizable writes and often require leader coordination for linearizable reads.
Sequential consistency: operations are seen in a single order, but not necessarily aligned with real time.
Eventual consistency: replicas converge over time, but reads can return stale data. Systems optimized for availability under partitions often choose eventual consistency or related models.

In practice, many databases offer tunable consistency, such as requiring reads to consult a quorum or allowing stale reads from followers. The right choice depends on whether your application can tolerate anomalies like reading old values or observing reordering across different keys.

Choosing the Right Approach

Consensus is not a feature you add casually. It shapes latency, throughput, failure behavior, and operational complexity. Practical selection tends to follow the fault model and the product requirements:

If you need strong consistency for coordination, leader election, and replicated metadata under crash faults, Raft is often chosen for its implementation clarity.
If you need maximum theoretical grounding and are willing to deal with complexity, Paxos remains a strong foundation and appears in many mature systems, sometimes in specialized variants.
If the threat model includes malicious or arbitrary node behavior, consider Byzantine fault tolerance, accepting the higher replication and communication cost.
If you prioritize availability during partitions, you may intentionally avoid strict consensus for all operations and instead adopt weaker consistency models with explicit application-level reconciliation.

Consensus is ultimately about trust: trust in the result, trust in the ordering, and trust that failures will not silently corrupt the system. When designed and applied well, agreement protocols turn unreliable components into a coherent platform that applications can build on with confidence.

Distributed Systems: Consensus

Distributed Systems: Consensus

What “Consensus” Means in Practice

Why Consensus Is Hard: Failures and Time

Quorums and Majorities: The Backbone of Crash-Fault Consensus

Paxos: The Canonical Consensus Algorithm

Raft: Consensus Designed for Understandability

Byzantine Fault Tolerance: When Nodes Can Lie

CAP Theorem: Consensus Under Partitions

Consistency Models: What the Application Actually Gets

Choosing the Right Approach

Write better notes with AI