A Web Scale Case Study: Facebook as a File System
This week at OSDI (the research conference on Operating System Design and Implementation), Facebook presented the latest in a series of papers that they have written over recent years that document the evolution of their storage system.
Given the enormous amount of excitement about "web scale" systems, and especially web scale storage products for the enterprise, I thought it might be interesting to look at these Facebook papers, and try to draw out a few observations about systems that pretty clearly define the web scale term.
f4: Facebook's Warm BLOB Storage System
In their paper at OSDI, the Facebook team motivate the development of a storage system that is specifically oriented for what they call "warm" storage. The warm storage idea falls from the observation that most storage workloads are highly non-uniform: there is a small amount of hot data, and then a very large amount of data that is infrequently accessed. This large set of infrequently accessed data is sometimes referred to as the "long tail" in access distributions, and it is a problem precisely because it is warm, and not completely cold.
Here's the working distinction between warm and cold: Cold data is stuff like backups. It's the sort of thing that services like Amazon's glacier have been developed for, where durable retention is important but where applications are willing to wait tens of minutes, or even hours, for access requests to be satisfied.
Warm data, on the other hand needs to be available for live, production requests. However, the request rate per unit capacity (IOPS/TB) is very, very low. This problem characterization is one of the most interesting aspects of the paper: At web scale, Facebook is disaggregating many facets of a conventional storage system and implementing appropriate solutions for each of them. f4 stores about 65PB of infrequently accessed data, and is designed to do so in a way that achieves a high degree of durability at a very low cost. That Facebook is designing storage specifically for the tail of the curve is interesting, and I am an enormous fan of the use of performance-per-unit-capacity metrics, such as IOPS/TB, as a means of guiding storage system design.
Facebook as a File System
So here’s a fun way to think about Facebook: The whole application, all of Facebook, is really just a file system. It’s graph, rather than directory structured. It is browser-facing rather than presenting a command prompt interface (although if you really want a command line...). Also, it’s not optimized for things like deleting or overwriting data, but otherwise it is really just a very large conventional storage system: a giant closet that you and your contacts put data into and sometimes take data back out of.
Thinking about Facebook as a file system, it’s interesting to observe how each of the individual storage-related components within a typical operating system have been broken out and implemented inside the system. It’s neat to see how each of these modules (like the namespace, or the buffer cache) is implemented as an enormous single-purpose distributed system in an architecture like Facebook's.
Let's consider some of these components:
File system metadata is implemented by a completely independent service. Tao, Facebook's system for storing the social graph, is a separate service that stores relationships between objects, manages access control, notification, and so on. The fact that social data is a graph, instead of a traditional file system hierarchy, makes it significantly harder to implement at scale. This is where Facebook's engineers are unafraid of throwing enormous amounts of RAM (through memcached) and presumably also quite a lot of flash, at making a big random access data structure perform well.
Data, in this case the pictures, videos, and related objects that people post on Facebook, are stored completely separately from the namespace level metadata. This is not a new idea -- cluster file systems have broken out metadata as an independent service for a long time -- but it is fascinating to see the degree to which the scale of Facebook has driven specialization in each of these two systems. Data is stored in a combination of Haystack (the older, existing bulk storage system) and now f4 for warm data. Optimization goals are reasonably clear in designing these systems: Haystack is designed to process writes sequentially, to serve any data read with only a single disk seek. f4 is designed to match peak load (20 IOPs/TB of stored data) to appropriately sized spinning disks.
Continuing the file system analogy, it's also fun to observe that in the context of Facebook, content distribution networks (CDNs) such as Akamai are effectively the file system page cache: The URIs that are fetched from metadata lookups in TAO and used as handles to access files from Haystack and f4 are transparently cacheable by CDNs. Just like the use of a page replacement strategy in RAM on a conventional OS, this approach allows sizing and efficiency of the caching layer to be implemented and diagnosed independently of the warm, durable data storage layer.
Web scale doesn't mean Hyperconverged
To be a bit clearer on this point, web scale doesn't mean hyperconverged at the scope of individual hosts. f4 is deployed at the granularity of what Facebook calls cells. A single cell is 14 racks, each containing 15 servers and associated network connectivity, and with a total of 6.3K 4TB drives, an f4 cell has about 25PB of raw storage capacity.
But within the cell, the role of individual servers is non-uniform. There are extra physical servers that each perform a special purpose. In fact, there are servers whose only responsibility is to handle rebuilding erasure coded stripes of data in the face of drive, server, or rack failure. Moreover, the f4 cell is only providing storage functionality: The application logic (hosted on Facebook's Hip Hop PHP runtime) is a separate bit of physical infrastructure.
This observation helps to clarify a claim that I've made in the past about hyperconvergence: It bugs me to see hyperconverged vendors arguing that fixed-ratio products, where homogeneous hardware bricks each implement all the components of the enterprise stack (distributed storage, network, and VM hosting), are representative of how large-scale infrastructure is built. It isn't.
To me, convergence is the idea that the traditional roles of storage, networking, and compute are blending together and increasingly being implemented in software rather than being a bunch of isolated physical components. Facebook does a wonderful job of demonstrating this by blending a set of simple purpose-built services in an environment that in many cases is co-designed with the hardware that it runs on. Convergence is about building easy-to-manage and scalable component services that can be used by developers and operations people to build infrastructure that meets their needs, and to evolve as those needs change.
So what is web scale?
So at this point I'd like to draw a bridge between these two worlds. I doubt that terms like "web scale" and "hyperconverged" mean much to Facebook's engineering staff. To them these are marketing characterizations that enterprise vendors are applying to systems that are a lot smaller than the ones Facebook's staff is used to thinking about.
However, in the enterprise context I increasingly believe that the term web scale is meaningful, especially because it is aspirational. So here are a few properties of a web scale system like f4 that I think can be very usefully set as goals for enterprise IT products:
- Efficiency is important. As a rough approximation, a server in your datacenter costs as much to power and cool over three years as it does to buy up front. It is important to get every ounce of utility that you can out of it while it is in production. Today's hardware -- modern Xeons, PCIe flash devices, 10/40Gb networks are all incredibly dense sources of performance and need to be deployed in a balanced manner. The f4 paper's design discussion of IOPS/TB is a great example of this kind of thinking.
- Services live longer than hardware. Hardware is a necessary but distracting reality. Any time where you need, as an operations person, to deal urgently with a hardware related decision around failure, provisioning, or device end-of-life, something is wrong. Building infrastructure that is defined by software, and that virtualizes the underlying hardware doesn't mean that these issues go away, but it does mean that they can be either automated (in the case of things like data migration) or made asynchronous enough that you can deal with them at your own convenience (hardware failure and provisioning).
- Services improve iteratively and in production. f4 currently hosts 65 PB of data and has only been in production use for 19 months. That is an absolutely astounding fact. The authors of the paper imply a continuous pace of evolution and improvements to the system. The benefits of analyzing production behavior of a deployed system and improving it continuously over a short cycle time are emblematic of how architectures at scales like Facebook are different, and are a property that web scale enterprise products need to deliver.
- Operations focus should move up the application stack. The result of the two points above is that significantly less time should be spent dealing with the tedious minutiae of yesterday's datacenter. The fact that I can characterize Facebook's system as a set of simple services that look a lot like a file system reflects to me the fact that the operations people at Facebook are thinking at that level too. If web scale systems aspire to anything, it should be to broadening and improving the way that we deal with the applications and data that run on top of our infrastructure rather than staring into the navels of our infrastructure itself.
A final thought
At Coho, we are one of several enterprise companies that have recently been describing ourselves as taking a web scale approach to system design. One thing that I hope is very clear in looking at "real" web scale systems like that of Facebook is that there is a relatively large distance between the scale of typical enterprise customers, and that of a Facebook or Google. If Coho focussed strictly on building systems at the scale of f4, we wouldn't have a very large pool of potential customers today. (Don't get me wrong: I'd be very happy to talk to you if you are interested in buying 14 racks of our product.)
That said, as a web scale enterprise product vendor, we do have one significant challenge that is different than these systems, and that problem is packaging. In the past, enterprise hardware has had a pretty hands-off relationship with the vendor that sells it and the development team that builds it once it's been sold. The result is that systems evolve slowly, and must be built for the general case, with little understanding of the actual workloads that run on them.
Conversely, Facebook’s engineers have one customer, one deployment environment, and one set of workloads. They have a completely contained environment that lets them move fast, respond quickly, and specialize.
My experience so far is that the biggest challenge, and possibly also the biggest opportunity in defining a web scale approach to enterprise IT is in bridging these two worlds. It's something that we at Coho are working really hard to do.
One last note: I've presented a very, very superficial characterization of f4, and a possibly incorrect characterization of many aspects of Facebook above. I'd encourage you to read the f4 paper for more details -- it's a fun read. Regarding the other point, I'm sure that people will correct me if I've said anything especially wrong or misleading about Facebook's internal systems.
Thanks for reading!
Figures in this post:
Top: The automatic inlaid machine of Fred Walton. https://commons.wikimedia.org/wiki/File:Automatic_inlaid_machine_of_Fred_Walton.jpg
The annotated diagram is based on Figure one of the Facebook F4 paper referenced inline above.
The second mechanical drawing is from "Histoire d'un inventeur: exposé des déscouvertes et des travaux de m. Gustave Trouvé dans le domaine de l'électricité" (1891). https://books.google.ca/books?id=UUMOAAAAYAAJ and the image was taken from Danielle Morgan's blog post about making images from this work available over github.