Distributed Storage System From Scratch: Prologue

A New Beginning

In recent years, as the scale of the Internet has grown, the amount of data we need to process has also increased; with the development of machine learning, our data has also become more valuable. Against this backdrop, large-scale distributed systems have become increasingly important. Unfortunately, because this field emerged relatively late, there are relatively few learning resources, and people’s knowledge and understanding of it are still fairly limited. I know some outstanding engineers who graduated from top universities and work at leading companies. Although they use products from the Hadoop ecosystem in their daily work, their understanding of the underlying principles of large-scale distributed systems is also quite limited.

Therefore, I believe that sharing my limited knowledge and helping everyone develop an initial, intuitive understanding of distributed storage systems is still very meaningful. So I am preparing to start such a series.

The goal of the introductory series is to implement a Redis-like distributed storage system with the following capabilities:

Large scale: it can support a storage cluster composed of thousands of machines, with read and write data distributed relatively evenly across these machines (hereafter, machines are called nodes).
High availability: every node in the system may fail, and automatic failure recovery can be performed after any node fails.
Disk storage: for each node, data is persistently stored on disk, but memory can be used as a cache for hot data.
It uses an interface similar to Redis (limited to string interfaces), and supports some advanced features beyond Get/Put.

System Architecture

The architecture of the target system is shown in System architecture diagram:

Figure 1. System architecture diagram

All user requests first enter the API Server, which then handles the system’s internal logic. The API Server periodically synchronizes data topology information with the MetadataServer, such as which DataServers hold which data, and then forwards user requests to the corresponding DataServers. The DataServer is the actual holder of the data and the actual provider of the service. DataServers periodically synchronize with the MetadataServer, report their status information to the MetadataServer, and receive control instructions from the MetadataServer.

Content Plan

As an introductory series, the main thread is to gradually implement the system mentioned above through learning by doing. However, every aspect of this system will be relatively simple, its functionality will be incomplete, and its performance will still need optimization. When implementing each small aspect of this system, I will first introduce some basic theory related to that aspect, and then choose one of the easier implementation approaches to explain in detail. The specific content and order are roughly as follows:

Introduce the relevant background.
Implement a standalone key-value storage system and support RPC calls.
Add primary-secondary replication to the standalone storage system.
Expand the standalone storage system into a multi-machine storage system through data sharding.
Add an API Service to hide internal implementation details from external users.
Add transaction coordination to support MGET/MSET commands.
Discuss other features, such as TTL, snapshots, backups, and so on.
How to evolve toward a full-featured Redis.
Kubernetes and operations automation.
How to evolve toward NewSQL.

The sample implementations in these articles are described using the latest standard of the C++ language, because C++ is the dominant language for implementing storage systems in the industry.