登录查看更多内容

Who moved my Reliability ? B.L.U.E ?

Walter Lee

GDE, GCP, AWS, Azure Cloud Expert, CKA/S, ex-Oracle, 38k Followers. Many X Certified in Clouds, DevOps & k8s. Hackathons Winner. Writer, Speaker, Mentor. Opinions are my own and not the views of my employer : )

发布日期: 2019年10月19日

After ~20 years of debugging and troubleshooting, I saw below "thieves" of our reliability. I called it "B.L.U.E.".

1/ B - Bugs, Bills

Bugs - There are many types of bugs (OS, Application code, network, firewall, DNS, etc). All can cause you downtime and reliability.
Bills - Who pays your bills ? What if this person goes on PTO/Sick leave ? Who is the backup ? Which credit card on file ? When will it expire ? What is the spending limit of this credit card ? In this new world of "Everything as a Service", if you forgot to pay your cloud/service bills, your APIs/services will likely suffer when the service providers stop till they got your payments.

2/ L - License, Limits, Loads

License - Temporary or permanent license ? When will be the renewal ? Did you update the license before it expires ? Many Cloud Marketplace trial products have a 30-day trial time limits.
Limits - What are your configurable limits ? There are OS, web server, applications, database limits. Do you know your configurable limits ? e.g. "ulimit -a" in OS ? What is your database limits ? e.g. Mysql max_connections ? Others, e.g. Java -Xmx heapspace ? Nginx worker_connections ? What is your Internet Provider Bandwidth limits ? What is your API service limits ?
Loads - What is your max transaction per seconds (TPS) ? Can you handle sudden spikes ? A faulty client side issue, e.g. API errors or faulty keyboard can cause a lot of request per minute (rpm) flooding your services.

3/ U - Users, Utilization, Unexpected, Unknowns

Users - Do you know your users ? When will they hit your web site or service most ? e.g. Recently PG&E web site cannot handle sudden user spikes.
Utilization - What is your normal CPU, Memory, Network utilization ? Can your VM resources handle 2x loads ? Do you have Kubernetes Horizontal Pod Auto-scaling (HPA) ? Do you have any Auto-scaling ?
Unexpected - Are you ready for Cloud provider outages ? e.g. AWS S3, Gmail, Azure ? Do you have good healthchecks ? Any LB/DR setup across All Regions and/or Availability Zones ? What is your Single Point of Failure (SPOF) ?
Unknowns - Root Cause Analysis is not easy and often very time consuming. Do you have any contingency plans for any unknown situations ? What is your workaround strategy ?

4/ E - Expired, External, Errors

source: https://www.beyondtheboxscore.com/2013/6/24/4456142/when-is-an-error-more-likely-to-occur-in-a-game

Expired - Every SSL certificate will expire! SSL connection will then fail (e.g. from one microservice endpoint to another). Are you ready ? Do you know when your Istio, Kubernetes certificates will expire ? Do you know ALL your time sensitive components expiration dates ?
External - Your CDN, Internet backbone provider, external DNS, AWS S3 bucket , APIs providers, clouds, etc, all can suddenly go down completely or partially.
Errors - We are HUMAN and Human will make errors. It can be a simple typo, "rm -r /", copy and paste in the wrong window, chef knife edit errors, API configure errors, etc.

Even if you CHANGE NOTHING in your code/configuration, all these above "thieves" can still move your cheese, i.e. Reliability and cause another outage ! Hope this can help you be more aware and prepared !

Shrikant B.

DevOps /Application release Engineering at Wells Fargo

5 年

Nicely Articulated Walter!! sometimes it’s challenging to Get RCA!!

1 次回应

要查看或添加评论，请登录

Walter Lee的更多文章

Save the Earth with Good Cloud Computing Practices

2023年4月22日

Save the Earth with Good Cloud Computing Practices

Today is the Earth Day 2023! Let us talk about how we can help save the Earth with some good Cloud Computing Practices.…

2 条评论
Tips and Lessons after Fully Certified in Google Cloud

2023年3月29日

Tips and Lessons after Fully Certified in Google Cloud

I started my GCP learning journey in 2017 and took my first GCP ACE certification exam in 2020-08. Finally passed ALL…

8 条评论
Choose your Cloud Regions Wisely so you can live Happily : )

2023年3月4日

Choose your Cloud Regions Wisely so you can live Happily : )

After I read above news on 2023-01-20, I think it is good to write more about the Choice of a Cloud Region. It is…

12 条评论
Table of Contents for my postings 7

2022年3月24日

Table of Contents for my postings 7

I just filled up my 6th ToC at https://www.linkedin.

1 条评论
Table of Contents for my postings 6

2022年2月23日

Table of Contents for my postings 6

I just filled up my 5th ToC at https://www.linkedin.
CKS/CKA exam tips (TL;DR)

2022年2月1日

CKS/CKA exam tips (TL;DR)

??????/CKA ???????? ???????? (????;????): 1/ ???? ???????? ???????? ????????????????????, e.g.

24 条评论
Table of Contents for my postings 5

2021年12月23日

Table of Contents for my postings 5

I just filled up my 4th ToC at https://www.linkedin.

1 条评论
Table of Contents for my postings 4

2021年10月10日

Table of Contents for my postings 4

I just filled up my 3rd ToC at https://www.linkedin.
Table of Contents for my postings 3

2021年2月7日

Table of Contents for my postings 3

I just filled up my 2nd ToC at https://www.linkedin.
Table of Contents for my postings 2

2020年9月5日

Table of Contents for my postings 2

I finally filled up my first ToC 1 at https://www.linkedin.

See all articles

Who moved my Reliability ? B.L.U.E ?

Walter Lee

GDE, GCP, AWS, Azure Cloud Expert, CKA/S, ex-Oracle, 38k Followers. Many X Certified in Clouds, DevOps & k8s. Hackathons Winner. Writer, Speaker, Mentor. Opinions are my own and not the views of my employer : )

Walter Lee的更多文章

社区洞察

其他会员也浏览了

Load Balancers and How to Build Them (Part 1)

Lies, damned lies, and complexity

Create an AKS Cluster With Application Gateway (AGIC) using Add-On(Greenfield) with External DNS & Lets Encrypt

Azure Weekly Updates - 07th November 2021

Azure Weekly Updates - March 06th, 2023

Practical guide to DNS Records in AWS Route 53

Highly Available Kubernetes Cluster with Kubeadm using HAProxy LoadBalancer

2. GCP Interview questions and answers for Load balancers setup

AWS Under the Hood?—?Day 12?— Choosing Between AWS Load Balancers: ALB vs. NLB – Features, Use Cases and Technical Considerations

Tech Tale #1-The Heroic Load Balancer: Ensuring Smooth Sailing Systems

Walter Lee的更多文章

Save the Earth with Good Cloud Computing Practices

Tips and Lessons after Fully Certified in Google Cloud

Choose your Cloud Regions Wisely so you can live Happily : )

Table of Contents for my postings 7

Table of Contents for my postings 6

CKS/CKA exam tips (TL;DR)

Table of Contents for my postings 5

Table of Contents for my postings 4

Table of Contents for my postings 3

Table of Contents for my postings 2

社区洞察

其他会员也浏览了

Load Balancers and How to Build Them (Part 1)

Lies, damned lies, and complexity

Create an AKS Cluster With Application Gateway (AGIC) using Add-On(Greenfield) with External DNS & Lets Encrypt

Azure Weekly Updates - 07th November 2021

Azure Weekly Updates - March 06th, 2023

Practical guide to DNS Records in AWS Route 53

Highly Available Kubernetes Cluster with Kubeadm using HAProxy LoadBalancer

2. GCP Interview questions and answers for Load balancers setup

AWS Under the Hood?—?Day 12?— Choosing Between AWS Load Balancers: ALB vs. NLB – Features, Use Cases and Technical Considerations

Tech Tale #1-The Heroic Load Balancer: Ensuring Smooth Sailing Systems