Large-Scale Cluster Management and Operations Automation

When discussing large-scale cluster management, operations automation is unavoidable. One person might be able to operate and maintain a few hundred machines manually, but as the cluster grows, what we need is one person operating tens of thousands of machines. This is a common problem in large-scale cluster management. Therefore, operations automation is very important. This is also why many companies are willing to move to the cloud: after moving to the cloud, operations problems are managed by the cloud platform. It is also why cloud platforms are so expensive yet still sell so well.

What kind of problem is operations automation, exactly? I think it can be broken down into the following subproblems:

Automatically detect faults
Automatically remediate faults
Safety (Safety, not Security)

Each of these problems is complex and difficult, and is far from as simple as it appears on the surface. The following is only a brief expansion, offering a few ideas.

Automatic Fault Detection

First, let us simply divide faults into two categories: hardware faults and software errors.

Hardware Faults

Commercial hardware usually provides interfaces for obtaining hardware fault information. This fault information is generated by hardware self-tests, such as hard disk S.M.A.R.T. The fault information reported through these interfaces is relatively accurate. However, there are also cases where the hardware has a problem but the hardware self-test does not detect it. This situation can also affect the software running on it and needs to be detected in other ways. This type of problem is usually confirmed as a hardware issue after the business side observes an anomaly and rules out software errors, or derives the conclusion from statistics. The specific hardware problem still requires further offline testing.

There is another special category of hardware fault whose point of occurrence is not on the local machine, but that can still affect services: network faults. Network faults can generally be classified by their symptoms as follows:

Incorrect transmitted content, such as bit flips
Massive packet loss, such as a sudden large number of CRC errors even though there is actually no error (possibly a hardware fault, or possibly a system kernel error)
Reduced bandwidth, such as a switch fault
Lost connections, such as a switch fault

Because network faults do not necessarily occur on the local machine, it is difficult to locate the problem from a single machine. Problems are usually discovered through PingMesh combined with other methods, such as traffic monitoring, key kernel metrics, and key switch metrics. A recent new direction is to combine SDN and machine learning to discover and resolve network problems.

Software Errors

Software errors come in all kinds of forms. A general-purpose mechanism for detecting and tolerating faults is complex and difficult (for example, the Byzantine problem), so users must be allowed to extend fault discovery rules flexibly and conveniently. In general, software errors include the following types:

Operating system errors
Dependency environment errors
Data errors
Configuration errors
Bugs in the software itself

Automatic Fault Remediation

A general-purpose solution for all problems (most likely) does not exist, but we also cannot customize a remediation method for every special fault. For hardware failures, the only path available is to replace the machine. For operating system problems, a reboot usually resolves the issue. If it does not, the system may need to be reinstalled. Reinstalling the system also covers system upgrade requirements (considering that sometimes the kernel needs to be upgraded to resolve the problem). Data errors require verification with strong checksums. Exactly how strong the checksum needs to be, and how much scope it needs to cover, is a matter on which opinions differ. Configuration errors and bugs in the software itself are actually problems that deployment needs to solve. If the problem can be discovered in time, it can be resolved simply by aborting and rolling back during the deployment process. As for how to deploy, that is another major topic worthy of its own article :D

Safety

The biggest concern with automation is safety. Failing to guarantee safety can lead to terrible consequences, such as instantly reinstalling the systems on tens of thousands of machines. In large-scale cluster management, so-called safety actually refers to safety rules, and what safety rules guarantee is the SLA. The expected SLA is usually calculated from MTBF (Mean Time Between Failures) to derive a theoretical upper bound. Automatic remediation lowers the theoretical MTBF, because we consider more fault factors when modeling through the fault discovery mechanism. Restarting services, rebooting machines, reinstalling systems, and replacing machines all cause service interruptions. We can assume that during fault remediation, the services on this machine are unavailable. At this point, how to ensure in an orderly manner that the remaining available machines are always greater than a certain number (and satisfy certain rules) is the core of safety.