Introduction to Distributed Systems

This article is written especially for readers who want to learn distributed systems but do not yet know where to start. It broadly and briefly introduces some of my own immature understanding of various aspects of distributed systems, helping readers build a panoramic view of the distributed systems field so they can then find areas of interest for deeper study.

To learn distributed systems, you need to answer the following questions:

(Requirement analysis) What problems do distributed systems mainly solve? What are their main application scenarios?
(Implementation approaches) What are the common problems in building distributed systems? What are the mainstream solutions to these problems?
(Technical challenges) What are the fundamental difficulties in implementing distributed systems? Which problems are affected by these difficulties?
(Industrial applications) What distributed systems are being built in industry? How are they developing?

A special note: this article is written entirely from personal experience, and some views and understanding may be incorrect. If you have any questions, please discuss them in the comments. Thank you.

Basic Problems and Application Scenarios of Distributed Systems

Distributed systems mainly solve the following two problems:

Scalability, specifically Scale Out
High Availability, specifically providing higher availability externally by designing software systems that combine multiple devices with relatively lower reliability

The arrival of the Internet era means that the scale of the problems we need to solve keeps growing, to the point where the growth rate of a single machine’s capacity can no longer meet demand. The capacity here may be compute resources (CPU, memory), storage resources (disks), or even network resources (network I/O). Distributed systems coordinate multiple endpoints at the software layer to work together and provide services, satisfying the demand for capacity. In essence, this is the same role played by technologies such as operating systems and virtual machines: using software to hide lower-level details from upper layers.

As the scale of distributed systems continued to grow, people (Google) discovered that using large amounts of commodity hardware instead of commercial high-availability hardware could save substantial costs. As a result, technologies that simulate high availability in software emerged, and this is also one of the major problems distributed systems focus on solving.

In essence, the development of systems software is the process of using software to hide underlying details. Once this idea is clear, the application scenarios for distributed systems are easy to identify. Start with the lowest-level hardware:

Compute resources (CPU)
Storage (disks, memory)
Communication networks (network cards, switches)
Entire physical machines (virtual machines)

Then move up to the operating system:

Scheduling (processes, threads)
Isolation (time-slice sharing, queues, namespaces)
Entire operating systems (containers)

Finally, move to specific applications: Software as a Service.

One point that particularly deserves attention is that although all of these cases hide underlying details, the degree to which those details should be hidden is still a question worth exploring. In my view, attempts that advocate completely hiding lower-level details—especially hiding the distinction between distributed systems and single-machine systems—have basically failed. Exploration in this area is mainly reflected in the topic of building Distributed Shared Memory. For performance reasons, upper layers need to understand and cooperate with lower-level details to some extent. Therefore, what details should be exposed and what interfaces should be provided remain questions worth considering. However, as hardware and lower-level infrastructure develop, this question also keeps changing.

Common Problems and Solutions in Building Distributed Systems

Node Organization

The most basic problem in distributed systems is how to organize distributed nodes so they can function together. There is a summary of this discussion in Paper notes: Some surveys on P2P. In general, distributed nodes can be organized in the following ways:

Hybrid Decentralized: A central node performs coordination or indexing, while the other parts are distributed, such as HDFS and BitTorrent.
Partially Centralized: There is no central node, but there are super nodes, such as a structure similar to DNS, where the parent layer of each layer is an elected super node.
Purely Decentralized: Fully distributed; nodes are completely peers.
- Unstructured: Nodes do not form a fixed structure. Similar to physical routers, they only know a few neighboring nodes. Therefore, operations such as queries mainly rely on flooding or routing operations similar to physical routers.
- Structured: Nodes form a fixed structure, such as Chord (ring), CAN (multidimensional space), and Tapestry (hypercube).

There is no essential difference between Partially Centralized and Purely Decentralized, but Hybrid Decentralized has long been criticized for the performance and availability problems caused by central nodes. Judging from the practices of Google and current open source systems, Hybrid Decentralized is a practical way to build reliable distributed systems at relatively low cost (in terms of code quality). However, as problem scale grows further, the central node does indeed become a bottleneck (which then requires using Federation to distribute the central node again), and the performance problems caused by the central node remain difficult to solve.

The organization of distributed nodes necessarily requires communication support. For a system with a central node, this problem is relatively easy to solve; you only need to additionally handle issues such as rate limiting and latency for the central node. For a system without a central node, the following problems need to be solved well:

Membership protocol: Which nodes are members of the current overlay network?
Naming service: In essence, an index from names to locations
Message routing: How do you deliver a message from one location to the destination location?
1. unicast
2. anycast
3. groupcast
4. atomic groupcast
5. total ordered groupcast

Node Collaboration

Because one of the problems we need to solve is capacity, there will inevitably be situations where the data we need to access is distributed across multiple nodes. In this case, nodes need to collaborate. At this point, basic components from traditional parallel computing need to be reimplemented in the distributed domain:

Critical sections and locks
Atomicity and transactions

Currently, solving this problem mainly relies on distributed transactions, and industrial implementations are transitioning from 2PC to Paxos [1]. Compare-And-Swap operations across shards do not exist [2]. Read-only or write-only distributed transactions may be optimized. There is a summary of this discussion in CAP, ACID, what can we do?.

Data Replication and Consistency

To support High Availability, distributed systems generally replicate data, but this may introduce consistency problems. Traditional consistency problems already appeared in parallel computing; the cache of a multicore CPU is a typical example. However, consistency problems have more difficulties in the distributed domain. This difficulty comes from the fact that the communication networks between physical machines are much less reliable than the communication inside a physical machine (especially between CPU cores, and between cores and memory or cache). There is a summary of this discussion in CAP, ACID, what can we do?.

Fundamental Challenges of Distributed Systems

In my view, all theoretical challenges in distributed systems originate from these two key points:

There is no global clock in a distributed system.
Communication is unreliable in a distributed system. (The communication latency required to guarantee that messages are neither duplicated nor lost is finite but unbounded.)

Because of these two points, we cannot build a total order of events:

We cannot order events by timestamp.
We cannot order events through communication (within bounded latency).

The total ordered groupcast, distributed locks, and transactions mentioned above, as well as problems not mentioned such as node failure detection, are all affected by this fundamental problem (event ordering).

Industrial Applications of Distributed Systems

This section only discusses relatively low-level systems.

Distributed Storage

Distributed storage systems were among the earliest distributed products to enter the public eye. Under the banner of NoSQL, they seized a large share of the traditional database market.

The essence of all storage systems is data plus indexes on the data.

The simplest distributed storage system is key-value storage. Its difficulty lies in how to build the index for keys and in the consistency problems that all distributed systems must face. Mainstream key-value stores use consistent hashing to build the index for keys. The problems in this area mainly focus on how to perform scans and on secondary indexes that may need to be extended. For related discussion of hash tables, see Hash table summary and discussion of advanced topics. Some distributed systems also build the index for keys using a tree structure (BigTable).

The consistency problem in distributed storage systems depends on the requirements of the business scenario, so I will not expand on it here. However, how to switch smoothly between different consistency levels is also a hot topic shared by academia and industry. In particular, under weak consistency, data conflicts may occur. How to provide business-friendly conflict resolution is also a good direction. For research on conflict resolution, see Paper notes: [Inria RR-7506] A comprehensive study of Convergent and Commutative Replicated Data Types and Paper notes: [ICDE'18] Anna: A KVS for any scale.

There are also some problems that both single-machine and distributed storage systems must face: performance optimization.

Performance optimization for special scenarios such as read-heavy/write-light and read-light/write-heavy workloads
Optimization of throughput and latency in random read/write and sequential read/write scenarios
Trade-offs and optimization for row-oriented storage and column-oriented storage

Although distributed storage initially raised the banner of NoSQL and disrupted traditional databases, it still needs to solve problems that traditional databases have already solved. One very important point is schema. With a schema, the system can understand the meaning of a value rather than just treating it as a blob. Only by understanding the meaning of a value can the system do more specialized things and further improve performance.

Some storage systems have also begun trying to add some compute functionality, but the implementation methods are still relatively primitive. Basically, they push computation tasks down to the nodes where the data resides and then aggregate the results. In my view, complete separation of storage and compute is the mainstream direction for the future. The advantage of the implementation method mentioned above is low latency, but distributed computing combined with systems that schedule based on metadata can fully achieve this effect and also bring additional benefits.

In addition, there are optimizations for special scenarios. One is OLAP, as well as the integration of OLAP and OLTP. Another is scenario-specific optimization for time-series data.

Distributed Computing

The earliest widely known distributed computing framework may have been MapReduce (not counting MPI). The MapReduce model is relatively simple and has difficulty meeting complex business requirements, so systems such as Tez and Spark emerged. In practice, the RDD+DAG model can already express most business logic well.

There is no essential difference between stream computing and batch computing; batch computing can basically be regarded as a special case of stream computing. However, due to historical reasons, current batch computing frameworks perform better than stream computing frameworks when doing batch computation. If stream computing does not care about data order, it can process data using micro-batches. If stream computing does care about data order, the problem becomes slightly more complicated. The difficulty is that, from the data perspective alone, when data is partitioned, it is finite but unbounded. But from the business perspective, if the data latency is large enough, the data can be considered discardable (meaning that such data does not exist). Therefore, a hard boundary can be forcibly drawn on the time limit. From the business perspective, data latency will most likely fall within a soft boundary. Data can be partitioned by this boundary, and data beyond the soft boundary can go through a side path for remediation, thereby keeping latency within the soft boundary with high probability. Batch computing can be viewed as stream computing over data with clear boundaries.

There are still business requirements that demand computation models more complex than Dataflow. The most general model is allowing users to write their own code with arbitrary complexity—that is, support for managed services. Currently, k8s has native support for managed services, while YARN has less support in this area and lacks an outstanding general-purpose framework. For some other widely used computation models, such as graph computing and machine learning, some computing frameworks already provide support.

People’s views on locality have also changed over time. In the earliest days, because distributed storage was lacking, people had no choice but to bind data and computation together. Another idea proposed by MapReduce is moving computation to the data, assigning computation to the nodes where the data is located as much as possible. This idea had actually long been used in HPC as well. However, as network stack speeds improved dramatically, people found that locality is no longer as important as before (of course, it is still important):

The ratio between compute and storage is difficult to determine in advance.
Network bandwidth within a TOR no longer has an overwhelming advantage over network bandwidth within a POD.

On the other hand, as cluster size grows and clusters iterate, it becomes difficult to ensure that all machines in a cluster are homogeneous (which was easy to guarantee in the HPC era). As a result, the demand for heterogeneous scheduling gradually becomes stronger and outweighs the demand for locality. In the end, scheduling is separated from computing frameworks and exists as relatively independent middleware.

At present, the mainstream computing models in distributed computing already have usable computing framework support. Subsequent problems mainly concern how to further improve performance, in terms of both throughput and latency. In my view, performance optimization in recent years has mainly involved bringing back features that already exist in the database field, so we should pay more attention to database journals such as VLDB. Other optimization points lie in problems unique to distributed systems, especially those caused by the lower reliability of system components compared with traditional HPC.

Distributed Scheduling

Because large amounts of commodity hardware are used instead of HPC systems, machines within a cluster start to become heterogeneous, and users' jobs are also heterogeneous. Therefore, the requirements for scheduling have increased significantly. Distributed scheduling mainly has the following development directions:

Fine-grained scheduling
Scheduling for special scenarios
Pluggable and flexible scheduling

Fine-grained scheduling is reflected in the following aspects:

Scheduling granularity becomes finer (whole machine → Slot → Fine Grain).
The types and dimensions of resources that need to be scheduled are growing (elastic/non-elastic, CPU/MEM/GPU/I/O/…).
The conditions of the physical machines being scheduled become more complex (resource heterogeneity, affinity, constraints, …).
Scheduling considers the state of the target machine (opportunistic scheduling, overselling, …).
Scheduling considers global policies.

Scheduling for special scenarios can sometimes be optimized greatly for performance, and sometimes has special constraint requirements, such as Gang Scheduling and scheduling with deadline requirements.

The way distributed clusters are currently used is also very different from HPC. It is no longer the case that one job runs to completion before another job runs; instead, many types of jobs run together. In this case, different jobs have different scheduling requirements, so a pluggable and flexible scheduling mechanism is needed.

Overall, the development trend in scheduling requires more global information. At present, because there is no support from a mature distributed database solution, people generally choose centralized global information solutions, which in turn limits cluster scale. To solve the problem of increasing cluster scale, Sharding and Facade approaches are generally used, but there are still problems that are difficult to solve.

Here is a bit of personal reflection: is scheduling suitable for optimistic concurrency? If scheduling uses optimistic concurrency, the consistency requirement would be lower, and then this problem could be solved without using a distributed database.

Other Middleware

In addition to the larger areas mentioned above, there are also directions that focus on solving specific detailed problems.

The first problems to solve are node organization (including methods such as consistent hashing) and message routing (including Sharding). Well-known middleware in this area includes Consul and Akka. After nodes are organized, they may also fail, so middleware for failure detection (Failure Detector) is needed, but this is a difficult problem in distributed systems. Some people then focus on solving reliable transport. The development trend in this area is to invert the protocol stack and make the transport layer, rather than the application layer, the top layer; Kafka is an example. Another trend is that Message Brokers are increasingly moving toward general-purpose storage, and the boundary between the two is becoming less distinct. For example, k8s uses etcd for "message passing." Locks, distributed transactions, and leader election are the same thing. The currently stable implementations are variants of Raft, and the trend is to use Paxos because there is more room for optimization (Raft is equivalent to a serialized version of Paxos).

Another problem brought by distribution is that DEBUG becomes even more difficult, because a user request is spread across different machines. To solve this problem, we use three different methods (Observability 3 ways: logging metrics and tracing):

Collect logs. Logs include the most detailed information on a machine.
Collect metrics. Because metrics are just numbers, they are convenient for statistics.
Do tracing, connecting the entire process of a user request.

Besides these, there are many smaller areas that I will not introduce one by one here.

References

[1] GRAY J, LAMPORT L. Consensus on Transaction Commit[J]. 2004, 1(April 2004).
[2] HERLIHY M. Wait-free synchronization[J]. ACM Transactions on Programming Languages and Systems, 1991, 13(1): 124–149.