Paper Note: [SOSP 2007] Dynamo: Amazon's Highly Available Key-value Store
Reviews the consistency, replication, conflict handling, and routing designs in the Amazon Dynamo paper, and summarizes the key engineering trade-offs it made …
As the title suggests, this paper describes an approach to building a highly available key-value store. The high availability mainly targets write requests. The storage environment is trusted, and the size of stored objects generally does not exceed 1 MB. The implementation is similar to Chord + MVRs (multi-valued registers), but it also includes many performance optimizations. Dynamo provides observable causal consistency. According to Paper Note: [PODC 2015] Limitations of Highly-Available Eventually-Consistent Data Stores, this is already the strongest consistency this class of systems can provide. [3]
Dynamo’s main requirements and constraints are as follows:
- need an “always writable” data store where no updates are rejected due to failures or concurrent writes
- all nodes are assumed to be trusted
- do not require support for hierarchical namespaces or complex relational schema
- 99.9% of read and write operations to be performed within a few hundred milliseconds
- zero-hop routing for DHT
In addition, there is an implicit requirement hidden in the paper: Periodical archiving of the dataset is a mandatory requirement for most of Amazon storage services.
To satisfy high availability for write requests, Dynamo does not handle conflicts at write time. Instead, it uses vector versioning to retain multiple results at the same time. During reads, if the ordering relationship can be determined through vector versions, the results are merged and the merged result is written back to the source. Otherwise, the system can only choose simple LWW (last write wins) or let the client handle it.
Unlike ordinary key-value stores, Dynamo’s get/put requests need to carry an additional context, which is opaque to users. Judging from the paper, this context contains at least two pieces of information: the vector version (when a client wishes to update an object, it must specify which version it is updating. This is done by passing the context it obtained from an earlier read operation, which contains the vector clock information) and the response speed of each node in this request (the coordinator for a write is chosen to be the node that replied fastest to the previous read operation which is stored in the context information of the request).
Dynamo’s replication strategy combines proactive and passive strategies [3]. Its proactive strategy is a “sloppy quorum” strategy. The N/R/W values must be configured in advance and satisfy R + W > N (typically N = 3, R = W = 2). First, the system locates the get/put position according to Chord’s method, then counts backward from this position to find N healthy physical nodes. During get or put, requests must be sent to all N nodes, but as long as R or W nodes succeed, the result can be returned to the client. Its passive strategy works as follows: each node builds a Merkle tree for the range it is responsible for and propagates it through the gossip protocol. According to the proactive replica strategy, N nodes are responsible for the same range, and these nodes can use Merkle trees to quickly compare the differences and synchronize them.
When a node fails, the proactive replication strategy above writes values that originally belonged to node A into another node B. At this time, B stores the value (persistently) in a special list and records a hinted handoff marker so that it knows the value should be transferred to node A. After B’s background task detects that A has recovered, this record causes the value to be passed to A.
As you can see, this kind of read/write request is still fairly complex to handle inside the system. For each request, Dynamo creates a state machine on the coordinator node. The state machine contains all the logic for identifying the nodes responsible for a key, sending the requests, waiting for responses, potentially doing retries, processing the replies and packaging the response to the client.Each state machine instance handles exactly one client request. For instance, a read operation implements the following state machine:
- send read requests to the nodes
- wait for minimum number of required responses
- if too few replies were received within a given time bound, fail the request
- otherwise gather all the data versions and determine the ones to be returned and
- if versioning is enabled, perform syntactic reconciliation and generate an opaque write context that contains the vector clock that subsumes all the remaining versions.
For the sake of brevity the failure handling and retry states are left out.
System Optimizations #
Dynamo makes many optimizations on top of Chord + MVRs:
- zero-hop routing
- virtual node on DHT ring
- coordinator choosing optimization
- client-drive coordinating
- resolve conflicts on read
- async write to disk
- vector versioning size control
- vector versioning merge result write-back (read repair)
- feedback controlled background task
The optimizations listed above are explained separately below.
zero-hop routing #
zero-hop routing is intended to reduce routing latency. The time complexity of Chord routing is $O(\log n)$, which is relatively hard to accept for an online system. Sequentially reading 1 MB of data from disk on a single machine takes only 10 ms, while one RPC query may take 1 ms. Dynamo chooses to use the gossip protocol so that every node has (weakly consistent) knowledge of information about all nodes globally. In this way, zero-hop routing becomes possible. In the usual case, user requests arrive uniformly at some node through an SLB, and that node immediately knows which node it should forward the request to. Another case is the client-drive coordinating mentioned above: the client periodically pulls the node information of the whole system from any node, so that in subsequent communication it directly knows which node the request should be sent to. The size of this global information can be estimated using the following data:
- IP address: 4 bytes
- Port number: 2 bytes
- A series of virtual node ids: Q / S * 16 bytes
- The size of each virtual node id depends on the size of the hash space. The paper uses MD5, so each id is 16 bytes.
- According to strategy 3 mentioned below, virtual node per physical node = total number of ranges / number of nodes in the system.
There are S nodes in total, and each node needs (6 + Q / S * 16) bytes, so the whole system needs (6S + 16Q) bytes, where Q is the number of pre-partitioned segments in the hash space and Q is much larger than S. Assuming S = 1k and Q = 100k, about 1.6 MB of routing information needs to be maintained.
However, considering that the Q ranges are pre-partitioned and are public information known in advance to every node, it is possible to transmit the virtual node id index instead of the virtual node id. In this case, each node’s information requires (6 + Q / S * 4) bytes, and the whole system requires (6S + 4Q) bytes. Assuming S = 1k and Q = 100k, about 406 KB of routing information needs to be maintained. This order of magnitude should be acceptable.
The paper also mentions that the algorithm in [4] can be used for average $O(1)$ routing, although its worst case is still $O(\log n)$. Moreover, when determining the preference list, the correspondence between physical nodes and virtual nodes must be known; otherwise, fewer than N physical nodes may be selected when choosing N nodes (each node only needs to maintain information about the following N physical nodes). I personally suspect that this method cannot be used.
virtual node on DHT ring #
A physical node can have multiple virtual nodes. In a DHT ring, each virtual node is treated as a node on the DHT ring.
Using virtual nodes has the following advantages:
- If a node becomes unavailable, the load handled by this node is evenly dispersed across the remaining available nodes.
- When a node becomes available again, or a new node is added to the system, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.
- The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.
In a Chord DHT ring, the range of each interval is not fixed, and node joins or leaves may affect this interval range. For the Dynamo system, this leads to the following problems:
- the nodes handing the key ranges off to the new node have to scan their local persistence store to retrieve the appropriate set of data items.
- the Merkle trees for the new ranges need to be recalculated
- there was no easy way to take a snapshot of the entire key space due to the randomness in key ranges.
These problems may not be significant for systems that use RocksDB as the single-node engine, but such a system did not exist at that time. Dynamo’s choice was Berkeley Database for 10 KB objects and MySQL for larger sizes.
The Dynamo system uses the following strategy to determine virtual nodes and partition the DHT space. In this strategy, the hash space is divided into Q equally sized partitions/ranges and each node is assigned Q/S tokens(virtual node id) where S is the number of nodes in the system, Q is usually set such that Q » N. Each node needs to maintain the information regarding the partitions assigned to each node.
coordinator choosing optimization #
any of the top N nodes in the preference list is allowed to coordinate the writes. In particular, since each write usually follows a read operation, the coordinator for a write is chosen to be the node that replied fastest to the previous read operation which is stored in the context information of the request. This optimization enables us to pick the node that has the data that was read by the preceding read operation thereby increasing the chances of getting “read-your-writes” consistency
client-drive coordinating #
This was briefly discussed above when explaining zero-hop routing. Specifically, the client’s get/put request must first be routed to a suitable coordinator node. This node can be any node in the preference list, or the client can simply take responsibility for it. This coordinator needs to build a state machine to maintain the whole workflow. When the client and Dynamo are in the same data center, this is feasible and can significantly reduce latency. Table 2 in the paper shows that 99.9% latency dropped from 60 ms+ to 30 ms+, and average latency dropped from ~4 ms to 1-2 ms.
A client periodically picks a random Dynamo node and downloads its current view of Dynamo membership state. Using this information the client can determine which set of nodes form the preference list for any given key. (Currently clients poll a random Dynamo node every 10 seconds for membership updates.)
resolve conflicts on read #
This optimization targets high availability for write requests rather than performance. For the concrete implementation details, see Section 4.4 of the paper and Paper Note: Time, clocks, and the ordering of events in a distributed system. Understandably, in an eventually consistent system, it is always possible for two different clients to write to two different nodes at the same time. In this case, if consistency is controlled at write time, either one of the write requests must be rejected, or one write request will be overwritten by the other. If the former strategy is used, high availability is lost. If the latter strategy is used, it is equivalent to LWW (last write wins), but some information is lost. LWW may be inappropriate for some business scenarios; after all, sometimes the business side can perform lossless merging based on content. See Paper Note: [Inria RR-7506] A comprehensive study of Convergent and Commutative Replicated Data Types.
async write to disk #
Dynamo provides the ability to trade-off durability guarantees for performance. In the optimization each storage node maintains an object buffer in its main memory. Each write operation is stored in the buffer and gets periodically written to storage by a writer thread. In this scheme, read operations first check if the requested key is present in the buffer. If so, the object is read from the buffer instead of the storage engine.
This optimization has resulted in lowering the 99.9th percentile latency by a factor of 5 during peak traffic even for a very small buffer of a thousand objects. Also, as seen in the figure, write buffering smooths out higher percentile latencies. Obviously, this scheme trades durability for performance. In this scheme, a server crash can result in missing writes that were queued up in the buffer. To reduce the durability risk, the write operation is refined to have the coordinator choose one out of the N replicas to perform a “durable write”. Since the coordinator waits only for W responses, the performance of the write operation is not affected by the performance of the durable write operation performed by a single replica.
vector versioning size control #
A possible issue with vector clocks is that the size of vector clocks may grow if many servers coordinate the writes to an object. In practice, this is not likely because the writes are usually handled by one of the top N nodes in the preference list. In case of network partitions or multiple server failures, write requests may be handled by nodes that are not in the top N nodes in the preference list causing the size of vector clock to grow. In these scenarios, it is desirable to limit the size of vector clock. To this end, Dynamo employs the following clock truncation scheme: Along with each (node, counter) pair, Dynamo stores a timestamp that indicates the last time the node updated the data item. When the number of (node, counter) pairs in the vector clock reaches a threshold (say 10), the oldest pair is removed from the clock. Clearly, this truncation scheme can lead to inefficiencies in reconciliation as the descendant relationships cannot be derived accurately. However, this problem has not surfaced in production and therefore this issue has not been thoroughly investigated.
vector versioning merge result write-back (read repair) #
After the read response has been returned to the caller the state machine waits for a small period of time to receive any outstanding responses. If stale versions were returned in any of the responses, the coordinator updates those nodes with the latest version. This process is called read repair because it repairs replicas that have missed a recent update at an opportunistic time and relieves the anti-entropy protocol from having to do it.
feedback controlled background task #
Each of the background tasks uses this controller to reserve runtime slices of the resource shared across all background tasks. A feedback mechanism based on the monitored performance of the foreground tasks is employed to change the number of slices that are available to the background tasks. For example, the background controller checks to see how close the 99th percentile database read latency (over the last 60 seconds) is to a preset threshold (say 50ms).
Other Questions #
While reading the paper, I encountered some questions. Some of them were answered in the paper, but for others the paper did not provide concrete solutions.
Section 4.6 of the paper mentions: In essence, the preference list of a key is constructed such that the storage nodes are spread across multiple data centers. This scheme of replicating across multiple datacenters allows us to handle entire data center failures without a data outage. This approach does not feel easy to implement. Intuitively, we need a total of S / N nodes located in other DCs (datacenters), and these nodes should be evenly distributed across the entire DHT space. However, this scheme will inevitably have cases where all N nodes are in the same DC.
The paper mentions that the coordinator needs to create a state machine for each request to handle the entire read/write process. What happens if the coordinator fails at this point? In my opinion, this is not a problem. First, the user has not received a reply, which is equivalent to the request timing out. At this point, the user should consider this request to have two possible states: failed or successful. If no node was written successfully, then the write request is effectively lost, meaning the request failed. If at least one node was written successfully, then later, when synchronization occurs through the Merkle tree, this record will be synchronized to all nodes that need to store it. If the node containing this record fails before synchronization, then the record is gone, which is equivalent to the request failing. If the record is read by a user before the failure, then according to the read repair optimization, this record will be propagated to all nodes that need to store it. If the failed node later returns and a write conflict occurs, it will also be handled by vector versioning.
The following are several questions that the paper did not mention but that came to my mind.
Node joins and departures cause virtual node tokens to be reassigned. How is this process carried out? If a newly joined node actively steals a token from other nodes, then multiple newly joined nodes may choose the same token at the same time. If nodes in the system discover the newly joined node and actively give it a token, that may be a relatively good choice, but they may need to record whom they have given tokens to, or give tokens with some probability. For example, when a node finds that its token count is greater than Q / S, it gives one of its tokens to the node that currently has the fewest tokens, and marks all corresponding records with a hand-off hint. The token can be given out when the record transfer is almost complete. At that time, new records are marked with a hand-off hint, and reads do not depend critically on that copy of the data. When a node leaves the system, how should the tokens it owns be handled? Obviously, that node cannot actively give them to someone else, because when a node fails and exits, it has no chance to execute actions. It seems the only option is to wait until nodes in the system find that their token count is less than Q / S, then check whether there are missing tokens and pick one up. Then the question arises: what if two nodes pick up the same token? How can we ensure that the token I pick, together with the existing tokens, forms a set of tokens uniformly distributed across the whole DHT space? The latter may be hard to do, so for now let us assume equal probability is good enough. As for the former, without using a consensus algorithm, it is probably very hard to get right. The simplest method may still be to arrange a group of coordinators outside the system, use a consensus algorithm for leader election to ensure there is no single point of failure, and then discover and evenly allocate these tokens.
Reference #
[1] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon’s Highly Available Key-value Store. Proc. Symp. Oper. Syst. Princ. (2007), 205–220. DOI:https://doi.org/10.1145/1323293.1294281
[2] Hagit Attiya, Faith Ellen, and Adam Morrison. 2017. Limitations of Highly-Available Eventually-Consistent Data Stores. IEEE Trans. Parallel Distrib. Syst. 28, 1 (2017), 141–155. DOI:https://doi.org/10.1109/TPDS.2016.2556669
[3] Stephanos Androutsellis-Theotokis and Diomidis Spinellis. 2004. A survey of peer-to-peer content distribution technologies. ACM Comput. Surv. 36, 4 (December 2004), 335–371. DOI:https://doi.org/10.1145/1041680.1041681
[4] Venugopalan Ramasubramanian and Emin Gun Sirer. 2004. Beehive: O(1)lookup performance for power-law query distributions in peer-to-peer overlays. System 1, 1 (2004), 8. Retrieved from http://portal.acm.org/citation.cfm?id=1251175.1251183
[5] Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (July 1978), 558–565. DOI:https://doi.org/10.1145/359545.359563
[6] Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. 2011. A comprehensive study of Convergent and Commutative Replicated Data Types. (2011).