Who moved my Reliability ? B.L.U.E ?
Walter Lee
GDE, GCP, AWS, Azure Cloud Expert, CKA/S, ex-Oracle, 38k Followers. Many X Certified in Clouds, DevOps & k8s. Hackathons Winner. Writer, Speaker, Mentor. Opinions are my own and not the views of my employer : )
After ~20 years of debugging and troubleshooting, I saw below "thieves" of our reliability. I called it "B.L.U.E.".
1/ B - Bugs, Bills
- Bugs - There are many types of bugs (OS, Application code, network, firewall, DNS, etc). All can cause you downtime and reliability.
- Bills - Who pays your bills ? What if this person goes on PTO/Sick leave ? Who is the backup ? Which credit card on file ? When will it expire ? What is the spending limit of this credit card ? In this new world of "Everything as a Service", if you forgot to pay your cloud/service bills, your APIs/services will likely suffer when the service providers stop till they got your payments.
2/ L - License, Limits, Loads
- License - Temporary or permanent license ? When will be the renewal ? Did you update the license before it expires ? Many Cloud Marketplace trial products have a 30-day trial time limits.
- Limits - What are your configurable limits ? There are OS, web server, applications, database limits. Do you know your configurable limits ? e.g. "ulimit -a" in OS ? What is your database limits ? e.g. Mysql max_connections ? Others, e.g. Java -Xmx heapspace ? Nginx worker_connections ? What is your Internet Provider Bandwidth limits ? What is your API service limits ?
- Loads - What is your max transaction per seconds (TPS) ? Can you handle sudden spikes ? A faulty client side issue, e.g. API errors or faulty keyboard can cause a lot of request per minute (rpm) flooding your services.
3/ U - Users, Utilization, Unexpected, Unknowns
- Users - Do you know your users ? When will they hit your web site or service most ? e.g. Recently PG&E web site cannot handle sudden user spikes.
- Utilization - What is your normal CPU, Memory, Network utilization ? Can your VM resources handle 2x loads ? Do you have Kubernetes Horizontal Pod Auto-scaling (HPA) ? Do you have any Auto-scaling ?
- Unexpected - Are you ready for Cloud provider outages ? e.g. AWS S3, Gmail, Azure ? Do you have good healthchecks ? Any LB/DR setup across All Regions and/or Availability Zones ? What is your Single Point of Failure (SPOF) ?
- Unknowns - Root Cause Analysis is not easy and often very time consuming. Do you have any contingency plans for any unknown situations ? What is your workaround strategy ?
4/ E - Expired, External, Errors
- Expired - Every SSL certificate will expire! SSL connection will then fail (e.g. from one microservice endpoint to another). Are you ready ? Do you know when your Istio, Kubernetes certificates will expire ? Do you know ALL your time sensitive components expiration dates ?
- External - Your CDN, Internet backbone provider, external DNS, AWS S3 bucket , APIs providers, clouds, etc, all can suddenly go down completely or partially.
- Errors - We are HUMAN and Human will make errors. It can be a simple typo, "rm -r /", copy and paste in the wrong window, chef knife edit errors, API configure errors, etc.
Even if you CHANGE NOTHING in your code/configuration, all these above "thieves" can still move your cheese, i.e. Reliability and cause another outage ! Hope this can help you be more aware and prepared !
DevOps /Application release Engineering at Wells Fargo
5 年Nicely Articulated Walter!! sometimes it’s challenging to Get RCA!!