A curious case of slow APIs
a slooooooooow API trying to make its way through the http world

A curious case of slow APIs

You're a tech lead who's just joined a team and on your first day, the CTO tells you:

"Our video processing API is very slow, looks like it's a problem with AWS EFS that we are using. Take a look and fix it."

---

You login to AWS and check EFS. There's a single big EFS with 2TB storage mounted on 12 EC2s.?You explore all the options on AWS EFS UI, but nothing unusual stands out. Nothing unusual in Cloudwatch too.

You're not sure how you should really test EFS for slowness.

---

With that, you try to figure out just how slow is the video processing API. There's no observability, so it's difficult to know latencies and other metrics. You login to a few EC2 machines on prod and start tailing nginx logs in hopes of finding something.

---

You write some bash pipelines with grep, awk and tr to filter logs for that specific API and redirect the output to a file. You see some requests take less than 800 ms while some requests take more than 8 seconds.?

"What's going on?" - you're thinking.

---

Still, there's no way to know why this API is sometimes slow. There's no trace, it's just the plain logs that you have. You decide to read the source code in the hopes of finding how this API is using EFS.

---

The code is a bit complex, no tests or documentation. The code is reading a bunch of video files from EFS, combines these files together, and stores the final file back in EFS.

You think you have found the culprit.

---

Surely,?

  • reading the files from EFS,
  • combining them, and
  • writing them back to the EFS

is causing all the latency.

You decide to add good old print statements in the code - printing the timestamps for each of the above actions along with the unique request id.

---

To your surprise, all the EFS reading, combining, and writing operation is done in milliseconds. You're confused. Where on earth is the latency of more than 8 seconds coming from?

You take another look at the code.

---

At the end of the function, the code is marking a bunch of rows as "processed" in a DB. That can't be the reason, can it? You add another set of print statements and deploy the code. It is indeed that this DB operation is taking 7-8 seconds for some API requests.

---

Unaware of the DB schema design, you rope in a DBA for debugging. The DBA tells you that there's some stored procedure running on the same set of tables by a background worker.

You probe further.?

---

You find out that the background worker has a config, called "processing delay" in seconds and it's set to 1 on production. That means the background process tries to process records older than just 1 second. You're still not sure what's really going on.

---

You ask whether this API was always slow or something changed recently.?

"It was working okay before", people say, but no one knows when it really started becoming slow.

We got to know about this issue when customers reported that their video processing is broken or slow.

---

Fast-tracking the story:

You ask around and find out that the previous value for "processing delay" was 300 seconds, i.e. 5 mins!?The local config value of 1 second got pushed to prod in a recent commit and that's when latencies started increasing.

---

You fix the config value and sure enough, the latencies are back to normal, less than 1 sec for all APIs. Workers were processing data older than just 1 second, hence resulting in lot of queries and possibly lock contention on DB. That's what was causing the APIs to become slow.

---

Recap:?

You started with?

  • debugging EFS
  • to tailing logs to find API latency
  • to debugging DB schema and queries
  • finally landed on the config file fix

which worked.

Real life debugging is pretty messy. The more situations you have to debug, the better you become.

---

I write such stories on software engineering. There's no specific frequency as I don't make up these. If you liked this one, you might love - https://www.dhirubhai.net/posts/chinmay185_youre-a-team-lead-whos-just-back-from-a-activity-7041718255962472448-qDIZ


Follow me - Chinmay Naik for more such stuff.

Ankur S.

Tech Lead @ Bridgenext

1 年

Insightful post, Apart from the config part that you mentioned in the post. Below are some points that can be a part of the checklist or food for thought [just my view]. 1. Good to have multiple read replicas for the DB. If sharded with the correct partition key then much appreciated. 2. Aware of queries frequently fired on the backend which includes aggregations with never-ending pipelines, huge map-reduce, tables/indexes/collections queried frequently with no indexes in place(should have well-defined unique/ primary keys, indexes). 3. DB queries that are highly optimized, peer-reviewed by seniors, and well-tested on production-level data clones are highly recommended to test the latency performance. 4. Ensure Caching with cache buffer sizes in DB are well configured to serve multiple requests that ask for huge data. They act like an edge in CDN. Caching at a higher level facing application payer is also recommended like Redis, DynamoDB, Memcache, etc to revert back the found data rather than even bothering the repository layer just below the service layer as per Domain Driven Design. 5. Schema's field modeling with the review before the production release. I have tonnes of things but reached the max char limit...duh!!

回复

要查看或添加评论,请登录

Chinmay Naik的更多文章

社区洞察

其他会员也浏览了