TidalScale and Reliable Computing
The Theory and Practice of Reliable System Design by Daniel P Siewiorek and Robert Swarz, Digital Press, 1982

TidalScale and Reliable Computing

Authored by?Dan Siewiorek, Professor of?Electrical and Computer?Engineering and Computer Science at Carnegie Mellon University?

Starting with my Ph.D. thesis, reliability has been a focus of mine for many years.?I was recently exposed to a new way of addressing reliable computing by the company?TidalScale.?What struck me was how this innovative approach that they call?TidalGuard?could use industry standard servers to dramatically improve the availability and reliability of the computing?system.?Further, because there is nothing unique from a hardware or software standpoint these benefits can be realized in the cloud or in a customer’s datacenter.??

Traditional Reliable Computing

The goal of fault tolerant computing is the correct execution of a computing workload in the presence of hardware and software defects and can be expressed as?availability?(i.e., the expected fraction of time the system is available to perform useful work) or?reliability?(the conditional probability that the system has survived until now).?The effect of defects can be overcome using redundancy that can be either temporal (repeated executions) or physical (replicated hardware or software). A traditional redundant system may go through as many as nine stages in response to the occurrence of a failure. The stages are outlined below.

Designing a redundant system involves the selection of a coordinated failure response that combines some or all these steps. It is important for the system designer to provide responses for as many of these stages as practical since the system will do something at each stage. It is better to have the system respond in a planned rather than an unplanned manner.?This is the challenge of traditional fault tolerant design.

Basic stages in fault handling[*]

  1. Fault Confinement?— contain it before it can spread. This stage limits the spread of fault effects to one area of the system, thereby preventing contamination of other areas. Fault confinement can be achieved through the liberal use of fault-detection circuits, consistency checks before performing a function (“mutual suspicion”), and multiple requests/confirmations before executing a function.
  2. Fault Detection?— find out about the error to prevent acting on bad data. This stage recognizes that something unexpected has occurred in the system. Many techniques are available to detect faults, but an arbitrary period, called fault latency, may pass before detection occurs.
  3. Fault Masking?— mask effects
  4. Retry?— since most problems are transient, just try again
  5. Diagnosis?— figure out what went wrong as a prelude to correction. This stage is necessary if the fault detection technique does not provide information about the failure location and/or properties.
  6. Reconfiguration?– work around a defective component. This stage occurs when a fault is detected, and a permanent failure is located. The system might be able to reconfigure its components either to replace the failed component or to isolate it from the rest of the system. The component may be replaced by backup spares. Alternatively, the component may be switched off and the system capability reduced in a process called graceful degradation.
  7. Restart?– re-initialize (warm restart; cold restart) this step occurs after the recovery of undamaged information. A “hot” restart, which is a resumption of all operations from the point of fault detection, is possible only if no damage has occurred. A “warm” restart implies that only some of the processes can be resumed without loss. A “cold” restart corresponds to a complete reload of the system, with no processes surviving.
  8. Repair?– in this stage, a component diagnosed as having failed is replaced. As with detection, repair can be either on-line or off-line. In off-line repair, either the system will continue if the failed component is not necessary for operation, or the system must be brought down to perform the repair. In on-line repair, the components may be replaced immediately by a backup spare in a procedure equivalent to reconfiguration, or operation may continue without the component, as in the case of failure-masking redundancy or graceful degradation. In either case of on-line repair, the failed component may be physically replaced or repaired without interrupting system operation.
  9. Reintegration?– after repair, go from degraded to full operation. In this stage, the repaired module must be reintegrated into the system. For on-line repair, reintegration must be accomplished without interrupting system operation. In some cases, upon reintegration, the repaired module must be re-initialized correctly to reflect the state of the rest of the functioning modules that it must now work with.?

TidalScale?addresses three major goals:?Simplicity,?Scalability, and?Reliability.

TidalScale’s approach builds on their core innovation of being able to virtualize the resources of a server (e.g., CPUs, memory, I/O) and then enables them to be mobilized.?Specifically, they can move these resources from one physical server to another physical server while the unmodified operating system and application continue to run.?This allows TidalScale to create a large, virtual server out of industry standard x86 servers.?They call these systems “software-defined servers.”?The migration of resources is done by TidalScale’s distributed hypervisor which they call a hyperkernel.?The hyperkernel runs on each server node and they coordinate with each other to create and maintain working sets with the appropriate guest virtual CPUs, guest virtual memory, and guest virtual I/O.

An important implication of this structure is that standard operating systems and applications run without any modification.?We thus achieve our first goal of?simplicity:?no changes to existing software are required, simplifying adoption.

By using a computing structure in which virtual resources can migrate, we can add and subtract physical servers from a distributed virtual machine.?Thus, we achieve our second goal:?scalability.

TidalScale uses this migration capability to hot swap a server to either avoid a failure or to proactively take a system offline for updates, diagnosis, or repairs.??Data collected from real operational systems indicate that there are often substantial amounts of time between the first error observation and unrecoverable failures.?During this period, TidalScale can automatically handle fault handling steps.

For example, when a problem is detected on physical server?n, the hyperkernels on all the other physical servers are told not to send any active guest physical pages or guest processors to?n.?An addition physical server may be added to the cluster to maintain previous performance levels.?In other words, n is quarantined. Physical server?n?is directed to evict all active guest physical pages of memory and guest virtual processors to other physical servers.?When this is complete, physical server?n?can be removed for repair.

A similar process can be used for upgrades of hardware or firmware.?All this is done without having to modify or restart the operating system, which is unaware that any of this is taking place.?Thus, the third goal is achieved:??reliability.

[*]?This section based upon “The Theory and Practice of Reliable System Design” by Daniel P Siewiorek and Robert Swarz, Digital Press, 1982

?

要查看或添加评论,请登录

TidalScale的更多文章

社区洞察

其他会员也浏览了