登录查看更多内容

A curious case of slow APIs

Chinmay Naik

Founder and CEO at One2N | Building Cloud Native Solutions | Is your business scaling faster than tech can handle? DM me.

发布日期: 2023年4月18日

You're a tech lead who's just joined a team and on your first day, the CTO tells you:

"Our video processing API is very slow, looks like it's a problem with AWS EFS that we are using. Take a look and fix it."

---

You login to AWS and check EFS. There's a single big EFS with 2TB storage mounted on 12 EC2s.?You explore all the options on AWS EFS UI, but nothing unusual stands out. Nothing unusual in Cloudwatch too.

You're not sure how you should really test EFS for slowness.

---

With that, you try to figure out just how slow is the video processing API. There's no observability, so it's difficult to know latencies and other metrics. You login to a few EC2 machines on prod and start tailing nginx logs in hopes of finding something.

---

You write some bash pipelines with grep, awk and tr to filter logs for that specific API and redirect the output to a file. You see some requests take less than 800 ms while some requests take more than 8 seconds.?

"What's going on?" - you're thinking.

---

Still, there's no way to know why this API is sometimes slow. There's no trace, it's just the plain logs that you have. You decide to read the source code in the hopes of finding how this API is using EFS.

---

The code is a bit complex, no tests or documentation. The code is reading a bunch of video files from EFS, combines these files together, and stores the final file back in EFS.

You think you have found the culprit.

---

Surely,?

reading the files from EFS,
combining them, and
writing them back to the EFS

is causing all the latency.

You decide to add good old print statements in the code - printing the timestamps for each of the above actions along with the unique request id.

---

To your surprise, all the EFS reading, combining, and writing operation is done in milliseconds. You're confused. Where on earth is the latency of more than 8 seconds coming from?

You take another look at the code.

---

At the end of the function, the code is marking a bunch of rows as "processed" in a DB. That can't be the reason, can it? You add another set of print statements and deploy the code. It is indeed that this DB operation is taking 7-8 seconds for some API requests.

领英推荐

Guide to AWS Serverless Computing- AWS Lambda -…

Naresh i Technologies 2 年前

When Not to Use Amazon EKS: Identifying Applications…

Diginatives 6 个月前

Amazon ECS vs. EKS: The Right Container Service to…

Stanra Tech Solutions Pvt Ltd 2 年前

---

Unaware of the DB schema design, you rope in a DBA for debugging. The DBA tells you that there's some stored procedure running on the same set of tables by a background worker.

You probe further.?

---

You find out that the background worker has a config, called "processing delay" in seconds and it's set to 1 on production. That means the background process tries to process records older than just 1 second. You're still not sure what's really going on.

---

You ask whether this API was always slow or something changed recently.?

"It was working okay before", people say, but no one knows when it really started becoming slow.

We got to know about this issue when customers reported that their video processing is broken or slow.

---

Fast-tracking the story:

You ask around and find out that the previous value for "processing delay" was 300 seconds, i.e. 5 mins!?The local config value of 1 second got pushed to prod in a recent commit and that's when latencies started increasing.

---

You fix the config value and sure enough, the latencies are back to normal, less than 1 sec for all APIs. Workers were processing data older than just 1 second, hence resulting in lot of queries and possibly lock contention on DB. That's what was causing the APIs to become slow.

---

Recap:?

You started with?

debugging EFS
to tailing logs to find API latency
to debugging DB schema and queries
finally landed on the config file fix

which worked.

Real life debugging is pretty messy. The more situations you have to debug, the better you become.

---

I write such stories on software engineering. There's no specific frequency as I don't make up these. If you liked this one, you might love - https://www.dhirubhai.net/posts/chinmay185_youre-a-team-lead-whos-just-back-from-a-activity-7041718255962472448-qDIZ

Follow me - Chinmay Naik for more such stuff.

Ankur S.

Tech Lead @ Bridgenext

1 年

Insightful post, Apart from the config part that you mentioned in the post. Below are some points that can be a part of the checklist or food for thought [just my view]. 1. Good to have multiple read replicas for the DB. If sharded with the correct partition key then much appreciated. 2. Aware of queries frequently fired on the backend which includes aggregations with never-ending pipelines, huge map-reduce, tables/indexes/collections queried frequently with no indexes in place(should have well-defined unique/ primary keys, indexes). 3. DB queries that are highly optimized, peer-reviewed by seniors, and well-tested on production-level data clones are highly recommended to test the latency performance. 4. Ensure Caching with cache buffer sizes in DB are well configured to serve multiple requests that ask for huge data. They act like an edge in CDN. Caching at a higher level facing application payer is also recommended like Redis, DynamoDB, Memcache, etc to revert back the found data rather than even bothering the repository layer just below the service layer as per Domain Driven Design. 5. Schema's field modeling with the review before the production release. I have tonnes of things but reached the max char limit...duh!!

要查看或添加评论，请登录

Chinmay Naik的更多文章

Data engineering mystery - rerouting large data in Kafka

2024年6月4日

Data engineering mystery - rerouting large data in Kafka

You're a tech lead handling a large-scale data pipeline. One day, your colleague (C) pulls you into an issue related to…
Vanishing Acts: The Mystery of Failing Database Writes

2024年3月29日

Vanishing Acts: The Mystery of Failing Database Writes

As an SRE at a growth-stage company, you're on call this week. A PagerDuty alert wakes you up.

16 条评论
A story about a nightmare scenario for every SRE

2024年2月21日

A story about a nightmare scenario for every SRE

Story time. It's about cloud security failures and why good engineering practices matter, especially during the One to…
Curious case of debugging failing webhook API requests

2024年1月9日

Curious case of debugging failing webhook API requests

A short debugging story to start off the new year. Developer (D): Hey, can you join a call? I need some help in…

1 条评论
Migrating Terabytes of metrics data with zero downtime

2023年12月27日

Migrating Terabytes of metrics data with zero downtime

You're an SRE responsible for VictoriaMetrics deployment with 30 Million time series/min. The CTO wants you to…

2 条评论
A curious story of debugging Machine Learning models

2023年11月29日

A curious story of debugging Machine Learning models

You're woken up by a p90 latency-related alert. This alert is for the main API service, so you start investigating…

3 条评论
Database reliability - Migrating Terabyte from self-hosted MySQL to GCP CloudSQL

2023年11月8日

Database reliability - Migrating Terabyte from self-hosted MySQL to GCP CloudSQL

You're a lead SRE and CTO asks you to manage and scale a self-managed 6-node MySQL cluster with 1.5+ TB data on…
Database reliability - zero downtime schema migrations with MySQL

2023年10月24日

Database reliability - zero downtime schema migrations with MySQL

(Database reliability story 1) You join a team as lead SRE, and the CTO asks you to manage and scale a self-managed…

12 条评论
Building Pull Request based ephemeral Preview environments on Kubernetes

2023年10月9日

Building Pull Request based ephemeral Preview environments on Kubernetes

A CTO of a company calls you. They just migrated from Heroku to AWS on EKS.

3 条评论
Taming GCP networking cloud costs

2023年9月26日

Taming GCP networking cloud costs

Here's a story of a pragmatic tech lead who understands networking fundamentals like iptables packet routing and NAT…

3 条评论

See all articles

A curious case of slow APIs

Chinmay Naik

Founder and CEO at One2N | Building Cloud Native Solutions | Is your business scaling faster than tech can handle? DM me.

领英推荐

Chinmay Naik的更多文章

社区洞察

其他会员也浏览了

AWS Goodies - June 3, 2024

The Way Quantum Group Made Ingovate India's AWS Horror a Success

Essential AWS Free Communities for Developers????

Containers on AWS (EKS vs ECS)

The Oracle at Delphi, EC1: Daedalus' Quest for Clarity in the Cloud

AWS Certified Solutions Architect – Associate Exam-practices

Modernising Docker on AWS Elastic Container Service (ECS)

A Step-by-Step Guide to Securely Exposing an API Gateway with AWS Services

Run scalable, cost-effective workloads on AWS in parallel using AWS Batch and Step Functions

领英推荐

Chinmay Naik的更多文章

Data engineering mystery - rerouting large data in Kafka

Vanishing Acts: The Mystery of Failing Database Writes

A story about a nightmare scenario for every SRE

Curious case of debugging failing webhook API requests

Migrating Terabytes of metrics data with zero downtime

A curious story of debugging Machine Learning models

Database reliability - Migrating Terabyte from self-hosted MySQL to GCP CloudSQL

Database reliability - zero downtime schema migrations with MySQL

Building Pull Request based ephemeral Preview environments on Kubernetes

Taming GCP networking cloud costs

社区洞察

其他会员也浏览了

AWS Goodies - June 3, 2024

The Way Quantum Group Made Ingovate India's AWS Horror a Success

Essential AWS Free Communities for Developers????

Containers on AWS (EKS vs ECS)

The Oracle at Delphi, EC1: Daedalus' Quest for Clarity in the Cloud

AWS Certified Solutions Architect – Associate Exam-practices

Modernising Docker on AWS Elastic Container Service (ECS)

A Step-by-Step Guide to Securely Exposing an API Gateway with AWS Services

Run scalable, cost-effective workloads on AWS in parallel using AWS Batch and Step Functions