Who moved my Reliability ? B.L.U.E ?
source: https://www.authorstream.com/Presentation/aSGuest52461-430118-moved-cheese-book-review-ppt-education-powerpoint/

Who moved my Reliability ? B.L.U.E ?

After ~20 years of debugging and troubleshooting, I saw below "thieves" of our reliability. I called it "B.L.U.E.".

1/ B - Bugs, Bills

  • Bugs - There are many types of bugs (OS, Application code, network, firewall, DNS, etc). All can cause you downtime and reliability.
  • Bills - Who pays your bills ? What if this person goes on PTO/Sick leave ? Who is the backup ? Which credit card on file ? When will it expire ? What is the spending limit of this credit card ? In this new world of "Everything as a Service", if you forgot to pay your cloud/service bills, your APIs/services will likely suffer when the service providers stop till they got your payments.

2/ L - License, Limits, Loads

  • License - Temporary or permanent license ? When will be the renewal ? Did you update the license before it expires ? Many Cloud Marketplace trial products have a 30-day trial time limits.
  • Limits - What are your configurable limits ? There are OS, web server, applications, database limits. Do you know your configurable limits ? e.g. "ulimit -a" in OS ? What is your database limits ? e.g. Mysql max_connections ? Others, e.g. Java -Xmx heapspace ? Nginx worker_connections ? What is your Internet Provider Bandwidth limits ? What is your API service limits ?
  • Loads - What is your max transaction per seconds (TPS) ? Can you handle sudden spikes ? A faulty client side issue, e.g. API errors or faulty keyboard can cause a lot of request per minute (rpm) flooding your services.

3/ U - Users, Utilization, Unexpected, Unknowns

  • Users - Do you know your users ? When will they hit your web site or service most ? e.g. Recently PG&E web site cannot handle sudden user spikes.
  • Utilization - What is your normal CPU, Memory, Network utilization ? Can your VM resources handle 2x loads ? Do you have Kubernetes Horizontal Pod Auto-scaling (HPA) ? Do you have any Auto-scaling ?
  • Unexpected - Are you ready for Cloud provider outages ? e.g. AWS S3, Gmail, Azure ? Do you have good healthchecks ? Any LB/DR setup across All Regions and/or Availability Zones ? What is your Single Point of Failure (SPOF) ?
  • Unknowns - Root Cause Analysis is not easy and often very time consuming. Do you have any contingency plans for any unknown situations ? What is your workaround strategy ?

4/ E - Expired, External, Errors

source: https://www.beyondtheboxscore.com/2013/6/24/4456142/when-is-an-error-more-likely-to-occur-in-a-game
  • Expired - Every SSL certificate will expire! SSL connection will then fail (e.g. from one microservice endpoint to another). Are you ready ? Do you know when your Istio, Kubernetes certificates will expire ? Do you know ALL your time sensitive components expiration dates ?
  • External - Your CDN, Internet backbone provider, external DNS, AWS S3 bucket , APIs providers, clouds, etc, all can suddenly go down completely or partially.
  • Errors - We are HUMAN and Human will make errors. It can be a simple typo, "rm -r /", copy and paste in the wrong window, chef knife edit errors, API configure errors, etc.

Even if you CHANGE NOTHING in your code/configuration, all these above "thieves" can still move your cheese, i.e. Reliability and cause another outage ! Hope this can help you be more aware and prepared !

Shrikant B.

DevOps /Application release Engineering at Wells Fargo

5 年

Nicely Articulated Walter!! sometimes it’s challenging to Get RCA!!

要查看或添加评论,请登录

Walter Lee的更多文章

社区洞察

其他会员也浏览了