Basic Methods for Single-Node Storage Engines

This is a summary and analysis of the basic ideas behind single-node storage engines.

After reading the Facebook Haystack paper [1], I found the design of its single-node object storage engine very concise, which sparked my interest in the designs of other single-node engines.

After reading [2], I gained a very rough understanding of single-node storage engines. First, the content that needs to be handled falls into two categories: data and its indexes. At this point, there are two approaches:

Store data and indexes together (clustered index), such as Berkeley DB, InnoDB, and LevelDB
Store data and indexes separately, such as MyISAM and Facebook Haystack

For indexes, considering only exact-match indexes, they can be categorized as follows:

Flat structures
- Hash tables, such as Bitcask
Tree structures
- The B-tree family, such as Berkeley DB, InnoDB, and MyISAM
- LSM trees, such as LevelDB
- Fractal trees, such as PerconaFT
- R-trees, such as PostGIS
- T-trees (pure in-memory structures), such as EXtremeDB

Approach Comparison (Personal View)

The advantages of storing data and indexes together are:

After the index locates the key, the value can be retrieved directly without another I/O operation
Index and data consistency is high
If the index is ordered, the data is also stored in order, making scans efficient

The advantages of storing data and indexes separately are:

Data and indexes can be organized in different ways
The index size is significantly reduced

If the entire index can fit in memory, a clustered index should not be used. When there are many write requests, it is best not to use a clustered index. If scan requests need to be supported and the data is stored on HDDs, a clustered index should be considered. If startup time matters, a clustered index should be used. See Update 1 for a detailed explanation.

For indexing approaches, we can see that very few storage engines use hash tables as indexes. In my opinion, this is because a hash table must be loaded entirely into memory and is difficult to persist incrementally. This is unacceptable for both traditional database scenarios and file system scenarios because it consumes too much memory. B-trees and LSM trees are popular also because indexes can be partially loaded into memory: hot indexes can be loaded into memory while the rest remain on disk, leaving the remaining memory available for other purposes—or, at worst, for caching hot data.

LSM trees are an indexing structure that has emerged in recent years. Compared with B-trees, their advantage is higher read and write performance. This benefits from the fact that split and merge operations can be performed in the background. The drawback of LSM trees is write amplification, which is improved in the paper [3]. Another issue with LSM trees is that when there are many write requests, background merging may not be fast enough to complete operations such as splitting and merging, causing overall performance to continue degrading.

Almost no storage engines adopt the Facebook Haystack approach of incrementally persisting in-memory index information by asynchronously writing to an Index File. This is because most storage engines are designed for small values. In Facebook Haystack’s scenario, the average value is about 80 KB, so a 100 GB database can store only about 1.3 million objects, and its index size should not exceed 10 MB. Common database usage scenarios should not be like this, although I do not have statistical data. Personally, however, I think that for pure key-value store scenarios (where no computation such as joins needs to be performed on the node), a relatively large index size is acceptable—for example, a 10 GB index. In such cases, Facebook Haystack’s design is still very attractive: a write request can be completed with the equivalent of one synchronous sequential write operation. If index memory usage is a concern, an LSM tree can also be used for indexing, but separating indexes from data still provides benefits; see the paper [4].

Update 1

I strongly agree with the view of @CatKang in the comments: the choice between separating data and indexes or storing them together should consider three dimensions: index characteristics, object size, and request characteristics.

Since two Zhihu users have already raised some questions about my view on how to choose a clustered index, I would like to expand this discussion further and provide the specific reasons behind my recommendation.

If the entire index can fit in memory, a clustered index should not be used: Using a clustered index may very likely make it impossible to load the entire index into memory, which is somewhat not worth the trade-off in this situation. A fully in-memory index has high performance and very few restrictions on the index structure. If possible, a fully in-memory index should be used as much as possible.
When there are many write requests, it is best not to use a clustered index: This does not mean clustered indexes should never be used; rather, it means considering a non-clustered index. Without a clustered index, the storage layout of written data can be organized more freely, thereby achieving higher write performance—for example, sequentially writing a single file as Haystack does, or writing multiple files in parallel to different cells on an SSD. Using a clustered index can avoid some problems introduced by non-clustered indexes, such as possible scan performance degradation, wasted space in deletion scenarios, and crash consistency issues between indexes and data. Inevitably, however, data writes are constrained by the organization of the index. With B-trees, random writes cause performance degradation. With LSM trees as indexes, in my personal view, this merely delays random writes to background threads. The cost of converting random writes into sequential writes is write amplification, and under sustained high-write workloads, compaction and write requests also contend for write bandwidth.
If scan requests need to be supported and the data is stored on HDDs, a clustered index should be considered: For scan requests, it is best for data to be laid out in order, so that one scan accesses exactly one contiguous region, which provides high performance. In this case, if the index itself is ordered, then using a clustered index provides the most efficient support for scan requests.
If startup time matters, a clustered index should be used: Non-clustered indexes have crash consistency issues between indexes and data. In crash recovery scenarios, the latest state of the index must be recovered from the data, increasing startup time.

References

[1] Beaver, D., Kumar, S., Li, H. C., Sobel, J., Vajgel, P., & Facebook, I. (2010). Finding a Needle in Haystack: Facebook’s Photo Storage. In Proc. USENIX Symp. Operating Systems Design and Implementations (OSDI’10) (pp. 1–14).
[2] Martin Kleppmann. 2016. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems (1st ed.). O’Reilly Media, Inc.
[3] Pandian Raju, Rohan Kadekodi, Vijay Chidambaram, and Ittai Abraham. 2017. PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees. In Proceedings of the 26th Symposium on Operating Systems Principles - SOSP ’17, 497–514. DOI:https://doi.org/10.1145/3132747.3132765
[4] Thanumalayan Sankaranarayana Pillai Lanyue Lu and Remzi H. Arpaci-Dusseau Andrea C. Arpaci-Dusseau. 2016. WiscKey: Separating Keys from Values in SSD-Conscious Storage. Fast ’16 13, 1 (2016), 1–28. DOI:https://doi.org/10.1145/3033273