Paper Notes: Paxos Made Simple

Paxos is a distributed consensus algorithm used to solve the problem of multiple distributed nodes reaching agreement in an asynchronous communication network when nodes may fail and local storage is reliable. The algorithm in the paper only solves the case of deciding a single decree. If multiple decrees need to be decided, the algorithm can be executed multiple times, once for each decree. It is conceivable that some steps may not need to be repeated across multiple decisions, so the algorithm can be further optimized when deciding multiple decrees, but this article does not discuss those optimizations in detail.

Algorithm Principles

To ensure the correctness of the algorithm, we need the following three basic safety properties:

Only a value that has been proposed may be chosen,
Only a single value is chosen, and
A process never learns that a value has been chosen unless it actually has been.

Among these, 1 rules out choosing an incorrect proposal, 2 rules out inconsistency among nodes, and 3 rules out learning that a proposal has been chosen before it actually has been (non-triviality).

In addition to ensuring the correctness of the algorithm, we also need to ensure that the algorithm can actually reach the state we want. At this point, however, we do not need to define the liveness condition precisely. Instead, we consider how to ensure, while preserving correctness, that some value is eventually chosen and that a process can eventually learn the result.

Based on these requirements, we introduce three roles: proposer, acceptor, and learner. Each process may play one or more of these roles.

The simplest approach is to select a single acceptor. All proposers send proposals to that acceptor, and the acceptor only needs to accept the first proposal it receives to satisfy all of the requirements above. However, once this single acceptor fails, we can make no progress. Therefore, we must set up multiple acceptors and collectively perform majority voting. Majority voting can satisfy all of the basic safety properties mentioned above, especially 2 (which can be proven by contradiction).

Without considering node failures and message loss, we want a proposal to be chosen even if there is only one proposal globally. This requires us to satisfy the following property:

P1	An acceptor must accept the first proposal that it receives.

However, this requirement introduces a problem: each proposer may simultaneously submit a different proposal to a different acceptor, at which point each acceptor accepts a different proposal and no majority can be formed. This problem is not triggered only by extremely rare cases. For example, even if only two proposals are each accepted by exactly half of the acceptors at the same time, the failure of one acceptor node may leave the remaining acceptors supporting two different proposals in equal numbers. In that situation, although one and only one proposal has been chosen, we cannot learn which proposal it is.

Therefore, property P1 and the use of majority voting to choose proposals determine that we must allow acceptors to accept multiple proposals. To do this, we assign a number to each distinct proposal value, so that each distinct proposal value has a different number, and the numbers are totally ordered. In this way, every proposal has the form \((n, v)\). We can therefore allow acceptors to choose multiple proposals, as long as those proposals have the same proposal value \(v\). By applying mathematical induction over proposal numbers, we can guarantee this with the following property:

P2	If a proposal with value \(v\) is chosen, then every higher-numbered proposal that is chosen has value \(v\) .

For a proposal to be chosen, it must first be accepted by acceptors. Therefore, we can consider constraining the acceptors when they accept proposals so that property P2 is satisfied:

P2a

If a proposal with value \(v\) is chosen, then every higher-numbered proposal accepted by any acceptor has value \(v\) .

Because communication is asynchronous, there may be an acceptor that does not know that some other acceptor has already accepted a proposal. At this point, if this acceptor receives a new proposal, property P1 requires it to accept the proposal, which may violate property P2a. Since all proposals are issued by proposers before being accepted by acceptors, we can consider strengthening the condition and having proposers guarantee this property:

P2b

If a proposal with value \(v\) is chosen, then every higher-numbered proposal issued by any proposer has value \(v\).

Applying mathematical induction to proposal number \(n\), we can see that satisfying the following property is sufficient to guarantee property P2b:

P2c

For any \(v\) and \(n\), if a proposal with value \(v\) and number \(n\) is issued, then there is a set \(\mathbb{S}\) consisting of a majority of acceptors such that either (a) no acceptor in \(\mathbb{S}\) has accepted any proposal numbered less than \(n\), or (b) \(v\) is the value of the highest-numbered proposal among all proposals numbered less than \(n\) accepted by the acceptors in \(\mathbb{S}\).

To satisfy property P2c, if a proposer wants to propose a proposal numbered \(n\), it must know whether any proposal with a number less than \(n\) has already been accepted by a majority of acceptors or will be accepted by a majority of acceptors before the proposal numbered \(n\) is accepted by a majority. We can learn which proposals have already been accepted, but it is hard to predict whether some other proposal will be accepted by a majority before the proposal numbered \(n\) is accepted by a majority. To avoid this situation, we can require acceptors in advance to make a promise not to accept any proposal numbered less than \(n\). Since P2c implies P2b, P2b implies P2a, and P2a implies P2, we can use the following algorithm to guarantee P2c and thereby satisfy property P2:

A proposer chooses a new proposal number \(n\) and sends a request to each member of some set of acceptors, asking it to respond with:
1. A promise never again to accept a proposal numbered less than \(n\), and
2. The proposal with the highest number less than \(n\) that it has accepted, if any.
I’ll call such a request a prepare request with number \(n\)
If the proposer receives the requested responses from a majority of the acceptors, then it can issue a proposal with number \(n\) and value \(v\), where \(v\) is the value of the highest-numbered proposal among the responses, or is any value selected by the proposer if the responders reported no proposals.

This method actually violates property P1. Considering that property P1 applies only when accept requests are initiated, while our algorithm actually has two kinds of requests, prepare and accept, receiving a prepare request means that an accept request with the same proposal number will follow. Therefore, when we receive an accept request with a different number, we can always ignore it without violating the property that we can guarantee at least one proposal can still be accepted. From this, we can derive a strengthened form of property P1:

P1a

An acceptor can accept a proposal numbered \(n\) iff it has not responded to a prepare request having a number greater than \(n\).

Since P1a implies P1, we have obtained an algorithm that satisfies all safety properties. With some optimization of the details, we get the following version:

Phase 1

A proposer selects a proposal number \(n\) and sends a prepare request with number \(n\) to a majority of acceptors.
If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.

Phase 2

If the proposer receives a response to its prepare requests (numbered \(n\)) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered \(n\) with a value \(v\), where \(v\) is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.
If an acceptor receives an accept request for a proposal numbered \(n\), it accepts the proposal unless it has already responded to a prepare request having a number greater than \(n\).

As long as a proposer follows the algorithm above, it is allowed to initiate multiple proposals. A proposer may also abandon an already initiated proposal at any time without affecting correctness. When an acceptor ignores a proposal because it has already promised not to accept lower-numbered proposals, correctness is not affected; however, for performance reasons, it is better to tell that proposer the largest proposal number currently known.

This algorithm can correctly choose a proposal, but learners are still needed so that all processes learn the result. Although all acceptors could notify all learners when they choose a proposal, this scheme requires many broadcasts (the size of the Cartesian product of the two sets) and has poor performance. At this point, it is sufficient to notify only a few distinguished learners, which then notify all learners. We need to consider that, because messages may be lost, all distinguished learners may fail to learn the currently chosen proposal. In that case, the proposer only needs to initiate proposals again until one proposal is chosen; then the chosen proposal can be known.

Although this algorithm can guarantee correctness, it cannot guarantee that progress is always made before the algorithm terminates; that is, it cannot guarantee that a consistent decision will eventually be reached. For example, two proposers may alternately propose new proposals, with each proposal number higher than the other proposer’s proposal number. In this case, every prepare request succeeds, but every accept request is rejected, so the algorithm cannot terminate. To ensure that the algorithm can terminate, we need to select a unique distinguished proposer. (The portion from here to the end of this paragraph is uncertain) Only this proposer, when it discovers that its proposal has been ignored by a majority of acceptors, chooses a new, sufficiently large proposal number and resubmits the proposal. Other proposers, when messages are lost, may resend using the original proposal number to ensure that a majority of acceptors can receive the proposal, but they do not resubmit the proposal when it is rejected by acceptors. This means that, in the case of decision conflicts, the distinguished proposer always wins. Alternatively, this problem can be addressed using a collision-avoidance approach similar to computer networks; for example, when a conflict occurs, multiplicatively increase the interval before resubmitting a proposal, and when there is no conflict, additively decrease the interval. This makes it possible to operate without any leader at all, but the potential risk is increased latency.

If enough parts of the system (including proposers, acceptors, the communication network, and other components) work correctly, electing a unique distinguished proposer is sufficient to guarantee liveness. (The original paper does not prove this, but it is said that [4] contains a proof of liveness, [3] appears to contain a less formal proof, and [5] contains a proof for a strengthened form of the algorithm, but I have not read any of them yet.) The paper [2] implies the conclusion that a reliable election algorithm must rely on randomness or real time (for example, a timeout mechanism). Regardless of the election result, the correctness of the algorithm described in this paper can be guaranteed. (Whether or not there is a distinguished proposer, and whether or not there is only one distinguished proposer, the correctness of the algorithm can be guaranteed.)

Additional Details

The algorithm needs to ensure that every distinct proposal has a distinct proposal number. This can be done by assigning each proposer a disjoint set of numbers in advance. For example, if there are three proposers in total, then proposer 1’s proposal numbers can only be selected from the set of nonnegative integers whose remainder is 0 when divided by 3.

The proposal value \(v\) in the algorithm may need to be transmitted multiple times, so its size should also be kept from becoming too large. However, sometimes it is indeed necessary to reach consensus on a large object, such as a block replica. In my opinion, a strategy similar to the separation of control flow and data flow in GFS [6] can be used: stream the large value in advance, mark it, and propagate it to each node in node-chain order. Each node caches this content using a strategy such as LRU, and then a decision is made through the Paxos algorithm. In this way, the proposal value \(v\) only needs to be able to index into the large value.

The previous algorithm can only decide a single decree. In practice, we often need to use such an algorithm to maintain consistency of each node’s state among distributed nodes [7]. To do this, each node can keep the most recent state-machine commands and replicate the command log to guarantee consistency of the state-machine state. In this way, we only need to determine the position of each state-machine command in the log to guarantee state-machine consistency; that is, we run the single-decree Paxos algorithm once for each position in the command log. (The portion from here to the end of this paragraph is my personal understanding) Let every process simultaneously play all three roles: proposer, acceptor, and learner. Since we need at least one distinguished learner and a unique distinguished proposer, we might as well elect a leader among these processes through a leader election algorithm, and have it serve as both the distinguished learner and the distinguished proposer. All state-machine commands are always sent to the leader. If a state-machine command is sent to another node, because the leader always wins in decision conflicts, that command will always fail. Each node keeps a fixed number of single-decree Paxos algorithm instances running, for example \(\alpha\) instances. When no state-machine command has been received, these algorithm instances can complete Phase 1, and the leader always wins. When the leader receives a state-machine command, it only needs to determine the command’s position and have the Paxos algorithm instance corresponding to that position execute Phase 2 (the original author says this should be a lower bound on the complexity of consensus algorithms; it seems that the paper [8] proves this, but I have not read it yet). Because communication is asynchronous, it may happen that an earlier-numbered state-machine command has not completed submission through Phase 2, while a later command has already completed submission through Phase 2. At this point, if the leader fails and a new leader is elected, holes will be left in the state-machine command log. Moreover, because the original leader has failed, no other node can know what state-machine command corresponds to that position. To avoid blocking the entire state machine, the new leader can choose to fill these positions with no-op commands to skip the holes, allowing the whole process to continue. The algorithm mentioned above for replicating state-machine commands applies only to a fixed set of nodes. If the node set needs to be adjusted at runtime, the simplest approach is to treat the command that adjusts the set as a state-machine command as well, with the current node set as the state maintained by the state machine.

Understanding Paxos from the perspective of two-phase commit (2PC): Phase 1 requires a majority of nodes to make a Promise (prepare), and Phase 2 requires those nodes to Commit (accept). Understanding it from the perspective of a quorum: when a Leader has already been elected, Phase 2 is executed directly, and only successful writes to a majority of nodes are required.

Reference

[1] Leslie Lamport. 2001. Paxos Made Simple. ACM SIGACT News 32, 4 (2001), 51–58. DOI:https://doi.org/10.1145/568425.568433
[2] Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson. 1985. Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (April 1985), 374–382. DOI:https://doi.org/10.1145/3149.214121
[3] Leslie Lamport. 1998. The Part-Time Parliament. ACM Trans. Comput. Sys-tems 16, 2 (1998), 133–169. DOI:https://doi.org/10.1145/279227.279229
[4] Roberto De Prisco, Butler Lampson, and Nancy Lynch. 2000. Revisiting the paxos algorithm. Theor. Comput. Sci. 243, 1–2 (2000), 35–91. DOI:https://doi.org/10.1016/S0304-3975(00)00042-6
[5] Leslie Lamport. 2005. Generalized Consensus and Paxos. April (2005), 60. DOI:https://doi.org/MSR-TR-2005-33
[6] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. Proc. Ninet. ACM Symp. Oper. Syst. Princ. - SOSP ’03 (2003), 29. DOI:https://doi.org/10.1145/945449.945450
[7] Fred B. Schneider and Fred B. 1990. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv. 22, 4 (December 1990), 299–319. DOI:https://doi.org/10.1145/98163.98167
[8] Leslie Lamport. 2006. Lower bounds for asynchronous consensus. Distrib. Comput. 19, 2 (2006), 104–125. DOI:https://doi.org/10.1007/s00446-006-0155-x