Why Fault Domains are so important in Oracle Cloud Infrastructure
Luke Martin Feldman
Senior OCI Cloud Evangelist | DevOps | Terraform | Multicloud
Yesterday my friend has asked me a question regarding new feature of OCI - Fault Domains. The question was somehow tricky and I had to think twice before the answer. The question was articulated this way: "Why Fault Domains are so important in OCI, especially if we can just place VMs in different availability domains?"
Well... This is just partially true. Let's remember the simple fact. When we spin up the first instance and any subsequent in one availability domain, we do not have any control over on which hardware box, this virtual machine will be placed in. And let's imagine we have two types of machines - webserver and some middleware server. Only this pair of the servers could work together smoothly. Lack of middleware server will just break the communication from the webserver up to the persistence layer (located in the files or within the database). On the other hand crash of the webserver will make middleware server blind for the requests from the UI users. So from the functional standpoint only harmonious duo of webserver and middleware server make sense. Ok, easy so far, right? Let's move further and let's imagine that customer has decided to connect his on-premise infrastructure and OCI with the usage of VPN or FactConnect. It means customer's networks in a hybrid solution could be interconnected. So some clients from on-premise or the Internet can communicate with the webservers, right? And let's imagine we are talking about a large number of requests reflecting the large scale of the customer business. From the architectural point of view, it is obvious that customer will spin up a couple of web servers for answering and servicing customer's workload. It could be for example 3 webservers per AD and let's say 2 middleware servers per AD (look at the topology diagram below - picture1):
And here is a thing, let's assume within the cloud we have some hardware failure of the physical box. This physical server will be restarted or even closed up. And because of the lack of fault domains, only middleware servers in AD1 will be unavailable. Why? Because cloud algorithm unaware of the server function has deployed middleware VMs on this unfortunate physical box. What will happen next? Of course, for a moment everything will be working fine, but as expected all middleware traffic will be redirected by the fully operational web layer (5 webservers) to the survived middleware servers in AD2. We could potentially expect some functional turbulence (look at the picture2 below).
Middleware servers could be overloaded with the unlimited workload from web layer. in a predictable turn of events, UI users could experience some significant latencies and web application, even somehow functional, will just provide a lot of frustration to the end users. Well, it doesn't look like uncommon scenario, right? But what else we could do to make more relief for our users. I think we could make this situation more stable and as a consequence more secure. How we should do it? Of course by the wiser configuration as follows (details on the picture3):
Let's assume within AD1 we will create two fault domains and then we will locate 2 webservers and 1 middleware server in the first fault domain. On the other hand, we will place third webserver and second middleware server within the second fault domain. How should this help us with the hardware failure? It is rather simple. Let's assume the hardware associated with the first fault domain will just be broken. It means still in AD1 we will have the pair - one webserver and one middleware server (fault domain 2). They can service the requests and Load Balancer can balance the workload between AD1 and AD2. We could expect the faster response times and general more stability in the system. It is because we have the better design. In that scenario more is under control. And you know what? My friend has been convinced by my arguments. What about you?
Luke
Cloud Consultant at Cognizant Technology Solutions
6 年Thanks. That's helped my understanding. I like the linked scenarios too. Fault domains also help with non resundant application stacks too, I like it.
Spreading positivity while enjoying LIFE!!! WAGMI || Ninja in the PM/ Customer Success - Web2 ~ Web3
6 年Great article Luke, Is it recommended to have at least one fault domain per AD in general scenarios for customers to overcome the hardware failures ??