Paper Note: [USENIX ATC'14] In Search of an Understandable Consensus Algorithm (Raft)
This article introduces Raft's leader election and log replication mechanisms, and discusses how it differs from Paxos in understandability and implementation.
Raft is a distributed consensus algorithm for solving the problem of getting multiple distributed nodes to reach agreement in an asynchronous communication network where nodes may fail and local storage is reliable. Given that Paxos had already been proposed to solve this problem, Raft was proposed mainly to address the difficulty of understanding Paxos. Paxos is more oriented toward theoretical research. Although it briefly mentions many implementation details, it does not explain or discuss them in depth, so implementing and optimizing the algorithm still involves many difficulties.
In my personal opinion, the Raft algorithm itself is not easier to understand than Paxos. It is like non-standard analysis in mathematics compared with standard real analysis based on the \(\epsilon-\delta\) definition of limits. Although using non-standard analysis to describe the concepts inside calculus is easier to understand, the foundational proof that guarantees its correctness is very difficult to understand. In standard real analysis, describing the concepts inside calculus is somewhat roundabout, but the difficulty of its foundational proof is not as high. By comparison, the safety and liveness properties of Paxos are easy to understand, while the Raft protocol gives some property constraints, but deriving a proof of linearizability consistency from them is much harder to understand [3]. In addition, this paper also provides some implementation details that are not discussed clearly in Paxos Made Simple [2], so it is friendlier to people implementing the algorithm. Relatively speaking, however, it is also much harder to optimize further.
Basic Idea
In the Single-decree Paxos protocol, if you want to make efficient progress, leader election is also needed. However, the Paxos paper introduces leader election quite late. Only at the final step, when the algorithm may be unable to make progress without a leader, does it propose introducing a primary node and always letting the primary node win so that the algorithm can proceed. In the Multi-decrees Paxos scenario, if there is already a (unique) primary node, the algorithm can also be optimized to a considerable extent. The Raft protocol develops exactly this point: first elect a unique leader, and then have that leader drive the log replication process. Here, the Raft protocol also makes an assumption: client requests are handled only by the leader. Considering the requirements above, the first part required by the Raft protocol is leader election, with the following constraints:
At any time, there can be at most one leader
Client requests are handled only by the leader
The leader never modifies its own log of state-machine commands; it only appends to it
Logs that the leader has already committed will necessarily appear in the log of the new leader after a leadership change
In the paper’s wording, these are:
| Election Safety | at most one leader can be elected in a given term. §5.2 |
| Client interaction | Clients of Raft send all of their requests to the leader. §8 |
| Leader Append-Only | a leader never overwrites or deletes entries in its log; it only appends new entries. §5.3 |
| Leader Completeness | if a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms. §5.4 (Here, term refers to the leader’s term of office.) |
Among these, Leader Append-Only and Leader Completeness constrain leader election so that the new leader must have the latest committed log.
Once there is a leader, the problem becomes simple. The leader only needs to order client requests and replicate them to the other nodes. This leads to the following properties:
| Log Matching | if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index. §5.3 |
| State Machine Safety | if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index. §5.4.3 |
Intuitively, there should be no problem: the leader always has the latest state, all requests are sent to the leader, and the leader responds with the request result after replicating it to a majority. Liveness is concentrated mainly on leader election, namely whether a leader can be guaranteed to be elected under these constraints. This paper does not prove that conclusion.
Implementation Details
The core of the Raft protocol is leader election. Raft uses a heartbeat mechanism to detect whether the leader is alive, thereby triggering the leader election process. Each node has three states: follower (Follower), candidate (Candidate), and leader (Leader). Each node initially starts in the follower state. If it believes the leader node has failed, it starts the leader election process:
First, it increments the term sequence number, and then changes its own state to candidate.
Then it proposes electing itself as leader, and broadcasts this request to the other nodes.
It repeats the steps above until one of the following situations occurs:
It wins the election
Another node has already become the leader
When a node receives approval from a majority of nodes, it can be considered to have won the election. Each node approves election requests from other nodes on a first-come, first-served basis. (For the reason here, refer to Paxos: if only one node runs for election, it will also win the election.) Considering the Leader Append-Only and Leader Completeness properties, this process should impose reasonable restrictions on leader election to prevent the newly elected leader from lacking the latest committed state, which would violate the State Machine Safety property. This restriction is: include the latest committed index currently known to the requester in the leader election request. When a node responds to another node’s leader election request, if the other party has a lower value than its own current latest committed index, it rejects that leader election request.
Raft chooses randomized waiting before retrying to avoid election ties as much as possible. According to the conclusion in [4], without relying on real physical time (including random quantities), the failure of only one node may make leader election fail. Therefore, we can only try to avoid this situation; there is no deterministic strategy that solves this problem without corner cases. Although there are other possible strategies for avoiding this situation, the authors of the Raft paper believe that a randomized back-off strategy is simple and less prone to corner cases. I personally agree with this view.
Once there is a leader, the other problems are easier to handle. All client requests are sent uniformly to the leader. These requests are ordered on the leader and then processed sequentially. The leader broadcasts the requests it receives to all other nodes. Once a majority of nodes acknowledge receiving a request, that request can be considered committed. At this point, the leader applies the operation to its own state machine and returns the result to the client.
Although the leader may fail before it has had time to send a request to a majority of nodes, two possibilities can occur at this point:
The next leader continues completing this request
The next leader abandons this request and overwrites it with other content
Although these two situations may alternate in any order, it can still be guaranteed that a request is committed if and only if it has been propagated to a majority of nodes, and once committed it will not be rewritten or overwritten. The original paper [1] does not provide a formal proof, but intuitively it feels correct because reasonable restrictions are imposed during leader election. The paper [3] contains a correctness proof (not sure; I have not finished reading it).
Generally speaking, the operation log of a state machine system will not be infinitely long. Therefore, when necessary, we need to compact the state-machine operation log. The Raft paper uses state-machine snapshots for log compaction, but many other approaches can also achieve this purpose, such as LSM (Log-Structured Merge tree). A snapshot can be regarded as the result of sequentially executing all operation logs before that point. After retaining a certain number of operation logs, the older subset of logs is replaced by the state-machine snapshot. One issue with snapshots is that some node may lag too far behind the leader, so that the operation logs it needs have already become part of a snapshot and individual log entries are no longer available. This may be caused by a slow node, a network partition, or a newly added node. In this case, the node can only fetch the latest snapshot, restore the state machine, and then continue fetching and applying the operation logs after that point.
Membership Changes
Membership changes become a problem because, due to different membership compositions, at the moment of switching, the two different groups of members may each elect their own leader, resulting in two leaders and violating the Election Safety property.
Because Paxos does not require that there must be a leader, nor does it require that there be only one leader, it does not need special handling; the member list can simply be treated as a state of the state machine. The Paxos paper [2] designs each proposer to buffer at most \(\alpha\) commands from clients, so it only needs to place the membership change command at the \(\alpha + 1\)th command to be committed. Before this membership change command takes effect, the newly added node can simply act as a learner. In addition, the approach in the paper [5] can also be referenced.
Raft handles this problem using overlapped clusters. During a membership change, it requires the two overlapping clusters to agree on the same leader. Note that nodes removed from the cluster may form a small closed system and spontaneously elect a leader, thereby interfering with the normal cluster. Special handling is needed here: when a follower (Follower) believes the leader is alive (through heartbeat packets), it no longer accepts leader election requests from other nodes.
A Simple Comparison with Paxos
Raft can be considered a simplified special-case version of Paxos. Because it provides more engineering details and omits some proof processes, it becomes easier to understand and implement. However, if further optimization is needed, Paxos should be a better choice. The two currently known optimization points are:
Leaderless parallel commit (Raft’s premise is that there is exactly one leader, so this cannot be optimized at all)
Commutativity of state-machine commands (for example, in a key-value store scenario, operations on two different keys are commutative. Raft does not have parallel commit, so there is also no way to optimize this)
References
[1] Diego Ongaro, and John Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. Atc ’14 22, 2 (2014), 305–320. DOI:https://doi.org/10.1145/1529974.1529978
[2] Leslie Lamport. 2001. Paxos Made Simple. ACM SIGACT News 32, 4 (2001), 51–58. DOI:https://doi.org/10.1145/568425.568433
[3] Doug Woos, James R Wilcox, and Steve Anton, et al. 2016. Planning for change in a formal verification of the raft consensus protocol. In Proceedings of the 5th ACM SIGPLAN Conference on Certified Programs and Proofs - CPP 2016, 154–165. DOI:https://doi.org/10.1145/2854065.2854081
[4] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (April 1985), 374–382. DOI:https://doi.org/10.1145/3149.214121
[5] Leslie Lamport, Dahlia Malkhi, and Lidong Zhou. 2009. Vertical paxos and primary-backup replication. Proc. 28th ACM Symp. Princ. Distrib. Comput. - Pod. ’09 February (2009), 312. DOI:https://doi.org/10.1145/1582716.1582783