How Do We Keep an Airplane in the Air 24/7 While We Continue to Upgrade?

When working on developing a SaaS service, I realised how closely it relates to my initial experience in telecom, where all our systems required 99.999% availability. The requirement in layperson's terms: we need to ensure the aeroplane (our service) stays in the air 24/7, 365 days a year, even during updates or upgrades. This aeroplane carries millions of users, and any crash is unacceptable as it impacts all of them.

Illustration with Z****** Platform

To illustrate, let me share three defects I encountered over three months with the Z****** platform. While I use Z****** as an example, many organisations faced similar issues. ?

  1. Rebalancing Baskets: I could not rebalance my Baskets for nearly six weeks due to ongoing issues. After extensive troubleshooting with Z******, it became apparent that the interaction between their software, K*** (sister company), and CDSL had a functional defect. Eventually, I was offered a workaround: authenticate with CDSL from K***, log out, and rebalance. It worked.
  2. CDSL Platform Downtime: In another incident, the CDSL platform went down, and the workaround offered was to skip the authentication process.
  3. R******* Transaction Delays: On June 4th and 5th, R******* did not report transactions on time, causing the software to fail in processing buy requests.

Each issue began with casual responses and eventually led to apologies, but customers had no recourse until the customer persisted. I estimate I lost over ?25,000, and I'm sure many others experienced similar losses.

Common Responses to Customer Issues

Having worked in similar roles for nearly three decades, I understand the typical responses from development teams:

  1. "It happens in your environment only."
  2. "It is a random issue; please try again later."
  3. "Shut down and restart."
  4. "Can you reproduce the issue and share logs and traces?"
  5. "It is tough to resolve since we can't reproduce it."
  6. "It happens under extreme load or once in a blue moon."
  7. "It is an act of God."

Many can relate to these responses, which are frustrating and often unhelpful.

Recommendations for Organizations

While compensation might be too much to ask in a country where justice is often delayed, I recommend that organisations take the moral high ground and become more transparent. They should share details immediately, including:

  1. What happened?
  2. How long has the problem persisted?
  3. Root cause analysis using the 5 "whys."
  4. How many customers were impacted, and what was the likely amount of loss?
  5. When the issue was discovered?
  6. How the impacted customers were informed proactively?
  7. What is the workaround until it is fixed?
  8. What corrective actions are taken?
  9. What preventive actions are planned?
  10. Whether a fix is required, the timeline, and recommended actions until the issue is resolved.
  11. Communicate with all customers about the recommended actions.

The same problem recurred the next day, and I am still trying to find an acceptable workaround.

Lesson in Accountability

This is my first public commentary on such issues on social networks. It aims to raise awareness in the software community about the importance of robust software design and engineering.

The biggest lesson I tried to impart to my child was this pattern for handling mistakes:

  1. Accept.
  2. Acknowledge.
  3. Apologise.
  4. Inform.
  5. Correct.
  6. Prevent.
  7. Avoid repeating the same mistake.

While the first three steps are often satisfied sometimes by force, the rest are frequently neglected because there is blame on a third party.

Conclusion

I often find defects and have several real-life examples across many organisations. Over time, I have moved from a combative approach to a more empathetic approach toward those working in software companies, recognising their constraints.

This article highlights the critical need for better software development and customer support practices, ensuring that "aeroplanes" remain in the air without compromising user experience.

All SaaS service providers must follow the same approach as the “Air Crash Investigation” series, as their services are equally critical for the public.

PS:?This issue occurred for many platforms on the 4th and 5th of June; it is likely to happen in the future from my point of view since I don’t believe the problem is solved to the extent required.

要查看或添加评论,请登录

Virendra Parmar的更多文章

社区洞察

其他会员也浏览了