TidalScale and Reliable Computing
Authored by?Dan Siewiorek, Professor of?Electrical and Computer?Engineering and Computer Science at Carnegie Mellon University?
Starting with my Ph.D. thesis, reliability has been a focus of mine for many years.?I was recently exposed to a new way of addressing reliable computing by the company?TidalScale.?What struck me was how this innovative approach that they call?TidalGuard?could use industry standard servers to dramatically improve the availability and reliability of the computing?system.?Further, because there is nothing unique from a hardware or software standpoint these benefits can be realized in the cloud or in a customer’s datacenter.??
Traditional Reliable Computing
The goal of fault tolerant computing is the correct execution of a computing workload in the presence of hardware and software defects and can be expressed as?availability?(i.e., the expected fraction of time the system is available to perform useful work) or?reliability?(the conditional probability that the system has survived until now).?The effect of defects can be overcome using redundancy that can be either temporal (repeated executions) or physical (replicated hardware or software). A traditional redundant system may go through as many as nine stages in response to the occurrence of a failure. The stages are outlined below.
Designing a redundant system involves the selection of a coordinated failure response that combines some or all these steps. It is important for the system designer to provide responses for as many of these stages as practical since the system will do something at each stage. It is better to have the system respond in a planned rather than an unplanned manner.?This is the challenge of traditional fault tolerant design.
Basic stages in fault handling[*]
TidalScale?addresses three major goals:?Simplicity,?Scalability, and?Reliability.
领英推荐
TidalScale’s approach builds on their core innovation of being able to virtualize the resources of a server (e.g., CPUs, memory, I/O) and then enables them to be mobilized.?Specifically, they can move these resources from one physical server to another physical server while the unmodified operating system and application continue to run.?This allows TidalScale to create a large, virtual server out of industry standard x86 servers.?They call these systems “software-defined servers.”?The migration of resources is done by TidalScale’s distributed hypervisor which they call a hyperkernel.?The hyperkernel runs on each server node and they coordinate with each other to create and maintain working sets with the appropriate guest virtual CPUs, guest virtual memory, and guest virtual I/O.
An important implication of this structure is that standard operating systems and applications run without any modification.?We thus achieve our first goal of?simplicity:?no changes to existing software are required, simplifying adoption.
By using a computing structure in which virtual resources can migrate, we can add and subtract physical servers from a distributed virtual machine.?Thus, we achieve our second goal:?scalability.
TidalScale uses this migration capability to hot swap a server to either avoid a failure or to proactively take a system offline for updates, diagnosis, or repairs.??Data collected from real operational systems indicate that there are often substantial amounts of time between the first error observation and unrecoverable failures.?During this period, TidalScale can automatically handle fault handling steps.
For example, when a problem is detected on physical server?n, the hyperkernels on all the other physical servers are told not to send any active guest physical pages or guest processors to?n.?An addition physical server may be added to the cluster to maintain previous performance levels.?In other words, n is quarantined. Physical server?n?is directed to evict all active guest physical pages of memory and guest virtual processors to other physical servers.?When this is complete, physical server?n?can be removed for repair.
A similar process can be used for upgrades of hardware or firmware.?All this is done without having to modify or restart the operating system, which is unaware that any of this is taking place.?Thus, the third goal is achieved:??reliability.
[*]?This section based upon “The Theory and Practice of Reliable System Design” by Daniel P Siewiorek and Robert Swarz, Digital Press, 1982
?