Reliability: Non-Stop Operation is the Ancelus Goal.
Table join test system. Billion record tables

Reliability: Non-Stop Operation is the Ancelus Goal.

Uptime is important. In operational systems it can be critical. To eliminate the root causes of downtime requires a passionate focus on the mundane world of what happens in the real world rather than the demo. Most DBMS vendors would rather not talk about it. Ancelus developers fixate on it. The hardest part is to contain unplanned downtime.

First, we need to recognize that all systems crash. We’ve made the core elements of Ancelus as bullet proof as possible through years of testing. But that isn’t always enough. Sometimes it’s a hardware failure. Or it might be a power spike that gets past the UPS, or a problem with the application code. Or maybe a cosmic ray flips a bit (the explanation when we can’t find a cause). But with Ancelus it doesn’t need to be a 5-alarm crisis. Once it happens the immediate priority is to return to service with up-to-the-event data. 

Several modes of system failure can be repaired in Ancelus with no downtime at all. If an index is corrupted there is a Fix utility that detects and repairs the indexes while the database is live – no downtime needed.

If an application failure leaves stale locks behind, the Ancelus Lock Monitor tracking utility can detect the offending thread and release the locks in small fractions of a second – no downtime needed. 

In the extreme case where the database must be restored, it can be accomplished in a few minutes from the last full backup plus journal update, rather than many hours of load-and-index. High speed utilities plus integrated indexes make it possible.

This demonstrates the durability of the Ancelus system using the real-time journal.  There are two other methods of retaining durable data. The snapshot backup for small data sets (generally under 2 GB) can deliver a quick full backup automatically every X seconds. It does a memory copy and writes to disk in the background. This puts a small amount of data at risk (the amount inserted in X seconds) and has a slower recovery mode (first, find the last good backup). For larger systems and those that cannot tolerate downtime a real-time replicate with hot fail-over will duplicate every transaction to two databases, detect the failure of the primary, switch to the secondary, and then repair and re-synchronize the primary. 

Whatever creates it, the expectation of DBAs is to get it back quickly. The cost of downtime in lost productivity, customer aggravation, process control upsets, and service level violations is simply too high. Not to mention the DBAs reputation.

 

Craig Mullins

Craig Mullins, President & Principal Consultant at Mullins Consulting, Inc. IBM Gold Consultant and IBM Champion for Data and AI

5 年

Eliminating the root causes of downtime is, indeed, an admirable goal. DBAs are always looking for ways to reduce downtime - and especially to minimize the amount of their precious free time required to keep their database systems up and running.?

要查看或添加评论,请登录

John Layden的更多文章

社区洞察

其他会员也浏览了