How Facebook Keeps Millions of Severs Synced
If you’re running a distributed system, it’s?incredibly?important to keep the system clocks of the machines synchronized. If the machines are off by a few seconds, this will cause a huge variety of different issues.
You can probably imagine why unsynchronized clocks would be a big issue, but just to beat a dead horse…
And many more reasons.
Why do computers get unsynchronized
For time keeping, the gold standard is an?atomic clock. They have an error rate of ~1 second in a span of 100 million years. However, they’re too expensive to put in every machine.
Instead, computers typically contain quartz clocks. These are far less accurate and can drift by a couple of seconds per day.
To keep computers synced, we rely on networking protocols like Network Time Protocol (NTP) and Precision Time Protocol (PTP).
Facebook published a fantastic series of blog posts delving into their use of NTP, why they switched to PTP and how they currently keep machines synced.
You can read the full blog posts below
We’ll summarize the articles and give some extra context.
Intro to Network Time Protocol
NTP?is one of the oldest protocols that’s still in current use. It’s intended to synchronize all participating computers to within a few milliseconds of UTC.
With NTP, you have clients (devices that need to be synchronized) and NTP servers (which keep track of the time).
Here’s a high level overview of the steps for communication between the two.
These timestamps allow the client to account for the roundtrip delay and work out the difference between its internal time and that provided by the server. It adjusts accordingly and synchronizes itself based on multiple requests to the NTP server.
NTP Strata
Of course, you can’t have millions of computers all trying to stay synced with a single atomic clock. It’s far too many requests for a single NTP server to handle.
Instead, NTP works on a peer-to-peer basis, where the machines in the NTP network are divided into strata.
And so on until stratum 15. Stratum 16 is used to indicate that a device is unsynchronized.
A computer may query multiple NTP servers, discard any outliers (in case of faults with the servers) and then average the rest.
Computers may also query the same NTP server multiple times over the course of a few minutes and then use statistics to reduce random error due to variations in network latency.
Here’s a fantastic article?that delves into NTP
NTP at Facebook
Facebook’s NTP service was designed in four main layers
Credits - Meta’s Engineering Blog
In terms of the process that runs on servers to keep them synchronized, Facebook tested out two time daemons
Facebook ended up migrating their infrastructure to Chrony and you can read the reasoning?here.
However, in late 2022, Facebook switched entirely away from NTP to?Precision Time Protocol (PTP).
Precision Time Protocol
PTP was introduced in 2002 as a way to sync clocks more precisely than NTP.
While NTP provides millisecond-level synchronization, PTP networks aim to achieve nanosecond or even picosecond-level precision.
There’s many things which can throw off your clock synchronization
PTP uses?hardware timestamping?and?transparent clocks?to better measure this network delay and adjust for it. One?downside?is that PTP places more load on network hardware.
Benefits of PTP at Facebook
Switching to PTP gave Facebook quite a few benefits
For more details on this, you can read the full post?here.
Deploying PTP at Facebook
With PTP, Facebook is striving for nanosecond accuracy. The design consists of three main components.
PTP Rack
This houses the hardware and software that serves time to clients.
It consists of
PTP Network
Responsible for transmitting the PTP messages from the PTP rack to clients. It uses?unicast transmission, which simplifies network design and improves scalability.
PTP Client
You need a PTP client running on your machines to communicate with the PTP network. Meta uses ptp4l, an open source client.
Follow Us: