Paper Note: [OSDI'10] Finding a Needle in Haystack: Facebook's Photo Storage

Haystack is a hot storage system specially optimized by Facebook for its photo storage scenario. Its system interface is a simple key-value store. The paper mainly describes the construction of its single-machine storage engine, optimized for a write-once, read-many, never-modified workload. This note only covers the single-machine storage engine portion of the paper.

Facebook’s photo storage scenario has the following characteristics:

Newly written photos quickly become very hot, and then gradually cool down over time (the feed scenario)
One written photo produces four different sizes (Thumbnails, Small, Medium, Large), and most read requests are for the Small size
Read requests far outnumber write requests, and once written, photos are not modified and are rarely deleted
Hot requests are already handled outside the system by an SDN, so most cache schemes are not very effective
The system mainly handles long-tail requests or requests that have not yet been pushed to the SDN
The average object size is 64KB ~ 85KB (according to the data given in §1)

Before Haystack, Facebook served photos using NFS + NAS. Several NAS servers provided storage service, with one file per photo. WebServers mounted these NAS servers through the NFS protocol. The problem with this approach was that reading one photo required three random disk accesses:

Read the file system index to find the inode
Read the inode to find the location of the actual data
Read the actual data

Although Facebook made many attempts, because most user requests were long-tail requests, attempts to reduce this process through caching basically failed. In this situation, the best choice was to reduce the size of the metadata and then load all metadata into memory. Note that this approach is feasible because:

Data is written once, read many times, never modified, and rarely deleted
No memory access control is needed
No structure is needed

Haystack’s architecture is relatively simple and consists of three parts: Haystack Store, Haystack Directory, and Haystack Cache. The paper focuses on the construction of the Store component and describes the other two components less. This is understandable, because the problem is mainly about how a single-machine key-value storage engine can be optimized for this special scenario.

The Haystack Store structure, from top to bottom, is as follows:

Hashtable (although the paper does not say this directly, it is mentioned in §3.6.2)
Index File (used to quickly rebuild the Hashtable, §3.4.4)
Physical File (about 100GB in size, storing the actual data, §3.4)
XFS (1GB extents, §3.4.5)
RAID-6 (256KB stripe size, §3.4.5, §4.4.6)

The flow of a write request is roughly as follows (some parts are my own inference, because some details are not written in the original paper):

Look up the in-memory table by Logical Volume ID to directly find the file handle
If the file is not large enough, pre-allocate it
Sequentially write the content to the end of the file (it is unclear here how concurrency is handled)
Update the in-memory Hashtable and respond to the client with success
Asynchronously update the Index File (it is unclear here how concurrency is handled, because in step 3 it is very likely that a smaller object completes first; in that case, how is the order of needles in the Index File guaranteed to be consistent with the order of needles in the Physical File?)

There are two possible guesses about concurrency handling:

The size of the data is already known in the write request, so the size of the needle can be derived. It is only necessary to maintain the offset of the write position, allowing multiple needles to be written concurrently
Serialize concurrent writes before processing them (would this be more friendly to HDD writes?)

A delete request only sets the deletion marker in the data area and adjusts the in-memory data structure, without touching the Index File. During reads, if an object is found to have been deleted, the in-memory mapping is adjusted again. GC is eventually completed through stop and copy. In the paper [2], it is mentioned that delete operations no longer depend on the delete markers in the Physical File and Index File, and instead directly operate on the Journal File and Hashtable.

Reference Implementation

https://github.com/chrislusf/seaweedfs

References

[1] Beaver, D., Kumar, S., Li, H. C., Sobel, J., Vajgel, P., & Facebook, I. (2010). Finding a Needle in Haystack: Facebook’s Photo Storage. In Proc. USENIX Symp. Operating Systems Design and Implementations (OSDI’10) (pp. 1–14).
[2] Subramanian Muralidhar, Wyatt Lloyd, Southern California, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, and Sanjeev Kumar. 2014. f4: Facebook’s Warm BLOB Storage System. Osdi’14 (2014), 383—398. Retrieved from https://www.usenix.org/conference/osdi14/technical-sessions/presentation/muralidhar
[3] Pandian Raju, Rohan Kadekodi, Vijay Chidambaram, and Ittai Abraham. 2017. PebblesDB: Building Key-Value Stores using Fragmented Log-Structured Merge Trees. In Proceedings of the 26th Symposium on Operating Systems Principles - SOSP ’17, 497–514. DOI:https://doi.org/10.1145/3132747.3132765

Other Questions

Reference Implementation

References