IIGR: “HUYGENS” - Exploiting a Natural Network Effect for Scalable, Fine-grained Clock Synchronization
Michael Orr
Computer Networking Architect/Product Manager/Customer Liaison, Due-Diligence, Tech Scout/Ambassador/evangelist, Inventor (40+ Patents), Dad, Martial Arts practitioner (Karate, Aikido), Stock & Option Trader
Who : Stanford: (Yilong Geng, Shiyu Liu, Zi Yin, Balaji Prabhakar, Mendel Rosenblum), Google: (Ashish Naik, Amin Vahdat)
Where: 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Usenix 2018
Original article here : Usenix '18/HUYGENS, Presentation/Video here: Usenix 18/HUYGENS Presentation
TL;DR for the genius-in-a-hurry:
A high-accuracy clock synchronization between nodes is useful for many applications (e.g. distributed Databases). Current state of the art presents a trade-off between accuracy and cost. HUYGENS achieves synchronizations within a few 10’s of ns, using only already-existing time-stamping capability of standard NICs. This is done using the following two main principles:
- Send pairs of probe-packets carrying transmit time. Receiver compares the difference in reception times with the sending time stamps. Only pairs that match are used.
- Use the Network-effect to get better accuracy.
Consider Figure 2.
(a) Nodes A and B measure/calculate their clock difference is 20 units (in black), while really (in green) the different is 10 units – but they have no way to fix their error.
(b) In pair-wise clock synchronization of 3 nodes – A, B, C, we find 10 extra time-units around the cycle, and distribute these “extra” units to get better accuracy. (Two sample adjustments are shown)
(c) If we have more nodes, accuracy gets much better. By using 4 nodes adjusted clock difference between A and B is now correct.
In HUYGENS, each node Synchronizes with 10-20 randomly selected nodes
The differences in Send/transmit times is used to derive an upper and lower bound on the clock-differences, clock-drift and One-Way Delay (OWD) for each pair.
They run each measurement interval over 2 seconds, so clock-drift due to temperature changes is minimized.
HUYGENS achieves synchronization within a few tens of nS using only already existing NICs, even at network loads of 90%. This only costs about 0.05% BW and 0.5% CPU.
HUYGENS gives clock-differences as it was during the last 2-second interval. By extrapolating from the last few intervals, an estimate of the CURRENT, real-time difference can be made, with minor loss of accuracy. Article calls this variant HUYGENS-R.
Article Summary
Having accurate (nanosecond level) clock synchronization between nodes in a network is very beneficial. The authors cite several examples to illustrate this.
- In ?nance, clock synchronization is crucial for determining transaction order: a trading platform needs to match bids and offers in the order in which they were placed, even if they entered the trading platform from different gateways
- In distributed databases, accurate clock synchronization allows a database to enforce external consistency and improves the throughput and latency of the database. They cite Google’s “Spanner” Global Database, and run a test using CockroachDB where a transaction has to be retried if it falls within the “clock uncertainty” period. On a 32 server cluster reducing clock Uncertainty from 1ms to 10 μS and then to 10ns reduced retry rates from 99.30% to 4.74% and then 0.08%
However, achieving accurate clock synchronization is hard. The current state-of-the-art methods are either cheap but of low resolution/accuracy (e.g. NTP) or need special hardware (e.g. IEEE15888 and PTP, GPS hardware, DTP) at each nodes, and hence are expensive. The article shows how to achieve good Clock synchronizations (to within a few tens of Nanoseconds) without using any special HW, for nodes in a Data-center connected by standard networking fabric.
HUYGENS is designed to work in datacenter networks whose topology is Fat Tree or CLOS, where the number of hops between any two servers is similar in both directions (tough they do NOT require that the same path be used both ways).
Principles of operation of HUYGENS
- They use packet-pairs, time-stamped by the sending and receiving NICs The receiver checks if the difference in reception time is similar to the difference in transmission time. If the answer is no, it means the probes either experienced different delays (e.g. one of the pair was queued along the way due to congestion) or took very different paths, and the two packets of such a pair are discarded. This only leaves probes that came along very similar (perhaps identical) paths, and experienced very similar one-way-delay
- They use the network effect to improve accuracy. A lot. They exchange probe packet-pairs between a mesh of randomly selected nodes. Using the data from many nodes at once allows calculating clock differences at much better accuracy than would be possible for each Sender/receiver pair alone.
Consider Figure 2
(a) Nodes A and B measure/calculate their clock difference is 20 units (in black), while really (in green) the different is 10 units – but they have no way to fix their error.
(b) In pair-wise clock synchronization of 3 nodes – A, B, C, we find 10 extra time-units around the cycle, and distribute these “extra” units to get better accuracy. (Two sample adjustments are shown)
(c) If we have more nodes, accuracy gets much better. By using 4 nodes adjusted clock difference between A and B is now correct.
HUYGENS operation
1. Each node in HUYGENS selects 10-20 Nodes at random to synchronize with (regardless of the total number of servers in the network). A node that gets probe packet-pairs responds to the sender.
2. Probe Packet-pairs where Difference in reception time is similar to the difference in their recorded transmission times are kept and used, and the rest discarded.
3. The one-way delay for each direction (to remote server and back) are calculated from the time-stamps. In reference to Figure 3, where T and R are the absolute times when probe packets were sent/received. De?ne ΔA = TXA ? t and ΔB = RXB – R (i.e. these are the time-stamp inaccuracies). Since R > T then we get RXB ?ΔB > TXA –ΔA, and by rearranging we get
(1) ΔB ? ΔA < RXB ? TXA and
(2) txB ? rxA < ΔB ? ΔA < RXB ? TXA.
Thus we get an UPPER and LOWER bound on the clock-difference between A and B
When these are measured and plotted a clear boundary value can be seen, as in figure 4.
The slope of the boundary line is the clock-drift between the two systems, and the Intercept is the clock difference.
For example, in the measurement shown in Figure 4, when clock of NIC A shows 0, NIC B will show somewhere between -92 and -94 nS, and for every 1S measured by the clock at A, the clock at B will measure about 1.6 μS less (since the slope is negative 1.6μS/Sec).
The distance between the boundary lines is an estimate of the RTT between the systems.
If we were to measure these over long period of time, the Clock-drift is not necessarily linear, since clock drift is affected by temperature. For example, when measured over 60 seconds, they got the situation shown in figure 7. Note the distance between upper an lower bounds is still about equal, and that over short period of time, 2-4 seconds, assuming the boundary is represented by a straight line is a good approximation.
HYUGENS uses Support Vector Machines (SVM) to read in the incoming points (i.e. measured upper and lower bounds calculated as per the equations above) and calculate the Slope and Intercept of the boundary lines. This is done at each HUYGENS node. The are sent to a centralized calculation element, where the Network effect (as illustrated above) is used to calculate a more accurate value, which is then used to synchronize the clocks. This is repeated at 2-second intervals, and each time HUYGENS calculates the clock-synchronization difference for the model of the already-measured interval (i.e. approximately one second in the past)
Note: I am not replicating here the authors’ derivation of this calculation. Interested readers are strongly encouraged to read this very-worthy articles in its original format
Results
The authors used an FPGA board with attached NIC interfaces to simulate 4 servers each.
In one evaluation instance “servers” belonging to a single FPGA board were synchronized using HUYGENS methods, and since all servers on the same FPGA board are run by the same clock, the “real” clock difference is known to be 0.
In a second test, two such FPGA boards were used, and their clocks were synchronized two different ways: (1) using HUYGENS and (2) by direct Connection over GPIO pins, which eliminates RTT and noise effects.
The results could then be compared to evaluate HYUGENS accuracy.
This was done over two network fabrics: one with 1Gbps links and switches and one using 40Gbps. (K is the number of servers each HUYGENS node probes)
As can be seen results are impressive – average accuracy of about 13 nS and 99th percentile of 30 nS.
They evaluated the benefits of the Network effect by increasing K from 2 to 12, and as can be seen accuracy improves at first, but law of diminishing returns kicks in – which is a good thing, since it means K, the number of servers that need to be probed does not have to scale with the umber of servers in the data-center, and even for 40Gbps, 20 servers probed is more than enough.
Since HYUGENS relies on probe-packets, and furthermore looks for probe packet-pairs where both packets experienced similar, hopefully minimal delay, the question of the effect of network laod naturally comes up. The authors tested HUYGENS in the face of high (90%) network load, and while accuracy suffers little, results are still very good. Apparently over the 2-second measurement interval enough probe-packets get through that after HUYGENS processing accurate clock differences can be calculated.
Since HUYGENS does not rely on Dedicated Hardware, a comparison with NTP is relevant, and in general HUYGENS is MUCH better – a difference of 4 orders of magnitude.
Cost
HUYGENS takes up some Network bandwidth to transfer the probes, and some per-server CPU cycles to calculate the initial (local) Slope and intercept of the boundary lines. Each server probes K other randomly chosen servers once every T seconds. Probing is bidirectional: servers which are probed send back probes every T seconds. In T-40, K = 20 and T = 500μs, and in T-1, K = 10 and T = 4ms
The authors claim 0.05% bandwidth cost for a 40Gbps network, and where each server probes 20 others. My own calculation leads to 0.1% (in each direction) - which is still Negligible (2000 PPS x 64 bytes x 2 Packets x 20 Targets = 40Mbps).
For the 1Gbps case they only use 10 targets, and send much fewer packets, which amount to 2.5Mbps (0.25%) BW cost.
Local calculation takes up 7ms of a single Core on a 2.8Ghz Intel i5 (which means 0.44% CPU cost of the total 32-core capacity for K=20)
Real time HUYGENS (HUYGENS-R)
As presented, HUYGENS takes measurements over a 2-second interval, and calculates the clock difference for the middle of that interval – which is to say for approximately 1 second in the past.
However, by extrapolating from successive values of the time difference for the last few measurement intervals, it is possible to estimate the CURRENT “real time” clock difference.
In Figure 13, the blue dots represent the actual Clock-differences HUYGENS calculates for each time interval. By doing a simple linear-extrapolation to the Green dots, we get an estimate of the clock-difference 6 time periods later. As can be seen, the estimates and the actual measured data quickly converge to the point that the extrapolated estimate has a very good accuracy.
Final Thoughts/Notes
1. It seems HUYGENS point the way to get, for DC networks very good clock synchronization without requiring 1588 Hardware. This is impressive, but comes at a time where 1588 Hardware is almost a given in modern NICs.
2. I wonder, if 1588 HW is already present (and hence its cost is already factored in), can HUYGENS methodology be used to get even better accuracy?
3. While the authors focus on NICs, I see no reason why the same could not be done between switches (where each Switch port would be the logical equivalent to a NIC) and between switches and servers. Since many switches nowadays allow some user-written program (sometimes packaged as a VM or Linux-container) to run on the swith, it may be that HUYGENS can be “retrofitted” to switches by the end-users, and does not necessarily have to be implemented by the switch vendor to be available.
I help companies engage customers early & co-build products to their needs —in just 90 days ?? My battle-tested method saves 50% on development costs & maximizes growth!
2 个月???? ??? ?? ?? ???????? ??? ????? ???? ?????? ???: ?????? ????? ??? ??????? ?????? ??????, ?????? ?????? ??????,?????? ????? ????????. https://chat.whatsapp.com/IyTWnwphyc8AZAcawRTUhR
Co-founder & CEO of Theom, Founder & CEO of Tetration, ex Cisco Fellow, ex Google, ex Insieme
6 年Interesting paper and idea - I like the theory and simplicity. What I wonder is a) is the idea “late to the game” with ieee 1588 present in all nics and switches these days in modern dcs is it worth upgrading the software and testing at scale? Food for thought for the authors Does the idea work go iot devices without rich network connectivity, where costs still are a major issue and proving gps time source is expensive. b) this one is not a problem with the idea, its more of comment on one of the use cases discussed in the paper. Synchronizing distributed systems with wall clock time is not a great idea. I do know this shortcut is used in places like eventually consistent dbs that leave it to the app layer to figure things out - sometimes people will argue its good enough, we have found hard to isolate bugs because of this short cut in the past. This has been discussed at length in the literature, there are also elegant solutions for it - search logical clocks and distributed systems.