Paper Note

Paper Note: [OSDI'14] F4: Facebook's Warm BLOB Storage System

This article summarizes the architectural design of Facebook F4 for warm data object storage, along with its erasure coding and cross-data-center fault-toleran…

F4 is an object storage system developed by Facebook to reduce storage costs. It is used in scenarios where data is read-only, deletable, and not writable.

After introducing F4, Facebook’s overall object storage architecture is shown in the following figure:

Overall BLOB storage architecture

Haystack is responsible for storing Hot objects, while F4 is responsible for storing Warm objects. Above them, the Router Tier routes requests correctly to the appropriate system. Because Facebook’s object storage is mainly used for photos and videos, and the primary use case is the Feed, an object’s temperature is highly correlated with its creation time. In Haystack, objects are written sequentially into Physical Volumes. When a Physical Volume is full (for example, 100 GB), it is marked read-only but remains deletable. After some time, the BLOBs in the entire Physical Volume are no longer Hot. At this point, the entire Physical Volume is moved into the F4 storage system to continue serving requests. In F4, a Physical Volume is split into several pieces (depending on the requirements of the Erasure Encoding in use). Parity is generated through Erasure Encoding, producing multiple Blocks, which are distributed across different nodes. Two different Blocks in different DataCenters are XORed once, and the result is stored in another DataCenter, as shown in the following figure:

Geo-replicated XOR Coding

If Reed-Solomon(10, 4) coding is used, each Physical Volume is first expanded by a factor of 1.4, and then two Physical Volumes are expanded by the size of one additional Physical Volume. Therefore, on average, each BLOB is amplified by \(1.4 + 1.4/2 = 2.1\) times. The architecture of the F4 system is shown in the following figure:

F4 single cell architecture

The processing flow for a read request is as follows:

  1. Find the corresponding Storage Node through the Name Node.

  2. Find the corresponding Block and Offset through the Index API.

  3. Read the corresponding Offset of the corresponding Block through the File API.

  4. If this Block has failed, read from the Backoff Node. The Backoff Node contacts any 10 Blocks + Parity to recover a single Block.

Some other details related to error recovery:

  • Recovery of an entire Physical Volume is performed on the Rebuilder Node.

    • All known Blocks + Parity can be obtained for recovery.

    • If that is not possible, recovery can also be performed through XOR.

  • Under the coordination of the Coordinator Node, Blocks are distributed to different Storage Nodes according to the Placement Policy.

Because F4 only needs to handle read-only cases, its design is relatively simple. New write requests go to Haystack, and the corresponding BLOB in F4 is then deleted (only the index needs to be deleted), because a burst of hot-serving traffic will arrive immediately after the write (when it appears in friends' Feeds).

References

  • [1] Subramanian Muralidhar, Wyatt Lloyd, Southern California, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, and Sanjeev Kumar. 2014. f4: Facebook’s Warm BLOB Storage System. Osdi’14 (2014), 383—​398. Retrieved from https://www.usenix.org/conference/osdi14/technical-sessions/presentation/muralidhar

  • [2] Beaver, D., Kumar, S., Li, H. C., Sobel, J., Vajgel, P., & Facebook, I. (2010). Finding a Needle in Haystack: Facebook’s Photo Storage. In Proc. USENIX Symp. Operating Systems Design and Implementations (OSDI’10) (pp. 1–14).