Test Engineer Perspective-Part 1:
Arif Chauhan
Test Automation | Performance Testing | Product Development | Python | K6 | Microservices |Bash Shell | TestOps| Kubernetes | Robot Framework | Azure | Telecom @Tanla Platforms Limited
When we build a high availability and enterprise scale platform, it is important to measure the availability and reliability before going live.
But the question arises, is it feasible to do it given the constraints of infra, time and sufficient data points?
Would love to hear opinions from like-minded professionals.
Personally, based on my experience in evaluating such kind of platforms, would follow the below approach - ?
1.? Setup test environment like production with all external interfaces connected with simulators.
2. Setup tools to generate input traffic like K6, JMeter, Locust etc.
3. Setup APM tools like Dynatrace, Grafana, ELK, cloud-native services etc.
4. Setup alerts in application stack, network, and server resources (CPU, Mem, storage etc.). One of the tools for this is Nagios or one can create custom scripts if feasible.
5. Create a workgroup of experts having skills of performance testing, network/Infra, Dev, DB and Ops.
6. Identify the metrics to be collected. In this context, we need to collect the following.
a. Uptime
b.?Downtime
c. MTBF – Avg time taken between consecutive failures.
d. MTTR – Average time taken to restore the platform.
7. Execute the endurance test for longer duration say for a week. Observe the platform and its components. And leverage the respective tools as mentioned above to record following -
- Number of failures
领英推荐
- Critical alerts
- Failure duration
- Time taken to repair/restore after failure
8. The interesting aspect is what if the platform does not fail. It sounds great; (however, it is unlikely ??).? So, we need to use chaos engg on critical components during the test run. And observe the above parameters.
?9. Calculate the availability as
(Uptime/Total Test Duration) *100
Where, Uptime = Total Test Duration - ∑ Downtime
10.? Calculate the MTBF as
(∑ Up Time)/Total No of Failures
11.? Calculate the MTTR
∑ (Down Time)/Total No of Failures ?
12.? Calculate reliability as
? (MTBF/ (MTBF + MTTR)) * 100
??
Would like to cover more practical aspects in my next post.