登录查看更多内容

Monitoring 3rd party API response time to proactively improve performance (moebel.de case)

Ivan Vokhmin

Lead Engineer Frontend @ moebel.de Einrichten & Wohnen GmbH | AWS, Team Leadership, Software Architecture, AI

发布日期: 2023年12月1日

Usually web apps (websites with SPA) are monitored for performance as “black boxes” - their response time (Time To First Byte, TTFB) is measured without looking what is happening inside the app. Even if the app is fine-tuned, it will still fetch slow/unstable 3rd party APIs (or your own backend APIs) to render user-requested content.

And suddenly, a question raises - why did app response time change without us releasing anything?

Black box approach

For some time, web app response time was taken as granted. There are tools that can monitor and report it consistently, like freshping/pingdom and CWV/lighthouse. However, only a developer can benchmark a request on a local machine to get all the insights on app internal timings. Usually, automatically reporting app response time is enough in case of simple requests that only invoke 1-2 external APIs if any. The app is treated as black box: it has request and response time. If it takes too long - a debugger must go in.

The problem of that approach is lack of proactiveness. Since the app is a black box, we know it got slower, but is it due to changed code or degraded backend performance? It is especially hard to tell during the business hours in case of frequent deployments.

Proactive approach

Reacting to changed response times always poses a challenge. It can be related to recent releases or some features that can be enabled through feature flags. However, some changes are hard to originate. Therefore, apart from monitoring our app external properties (like CPU/memory/network usage) we should dive into black box to see how much time is spent communicating with external APIs and services.

In moebel.de, I implemented the following approach:

for some users, we generate a special parameter on request time (server-side), called traceId
when traceId is present, we send networking statistics from external APIs to our metrics collector
collecting statistics is done by wrapping every request (like fetch) with a function that transparently measures/reports request time, like

const response = await wrapPromiseWithReporter(
  fetch(‘https://some-api.com’, 
        ‘API name to show on the graph’, 
        traceId, 
        { extra API logging info }));
// ... use response data ...

These measures are helping us to get following data:

how much time it takes for networking with external APIs in general
how much time is spent with 3rd party APIs networking for every user (sum of all networking time by traceId)

Of course, this approach has some downsides. Like, we don’t measure time to the first byte, we measure the whole request-response time, especially when 3rd party libraries are used (like algolia SDK). Means we don’t have perfect stats. Also, reporting the metrics takes time/network/CPU. So we generate traceId only for ~1% of random users.

Performance tuning

By investigating every response time for one traceId, we can say which APIs contribute the most to the networking time with 3rd party APIs and slowing down our webserver responses. They are good candidates for internal caching solutions to improve performance.

Real case study

One day, we got a visible response time increase which was hard to originate. It was not related with any of our recent product deployments (see picture 1 below).

However, in the API response timing graph, it was very clear which API is misbehaving (it looks like this after filtering down, see picture 2 below).

It took us some time and effort to bring API back to normal, but after the fix, response time went to sane values again.

traceId on client side

Reporting measurements don't end up in our next.js servers. The traceId is preserved in next data and transferred to the client. Then, we can see how much time it takes web clients to interact with different APIs. However, this heavily relies on user connection - some cellular users have insanely long response times from some APIs (like 30 seconds) on slow 2g-3g connection. But still, we are able to get valuable anonymous field data for real devices.

Conclusion

Measuring and reporting 3rd party and internal API response time for small percent of users can help improve web app performance and act proactively in case of unforeseen increases in response time for 3rd party services.

要查看或添加评论，请登录

Ivan Vokhmin的更多文章

(GitLab) CI Pipeline Tricks: Automating Aurora Serverless v2 Cluster Restorations

2025年3月14日

(GitLab) CI Pipeline Tricks: Automating Aurora Serverless v2 Cluster Restorations

Introduction During my work on a CMS project using AWS Aurora Serverless v2, I faced numerous challenges related to…
Datadog vs self-hosted grafana/loki for observability - migration case

2024年11月25日

Datadog vs self-hosted grafana/loki for observability - migration case

Observability matters. Choosing right platform to retain and visualize logs and metrics is important for incident…
Executing scheduled serverless tasks with AWS ECS fargate or lambda

2024年9月11日

Executing scheduled serverless tasks with AWS ECS fargate or lambda

Recurring tasks require compute powers to be provisioned at predefined schedule. Like "process sales report at the end…
Technical challenges of AB test user segregation

2024年8月23日

Technical challenges of AB test user segregation

Every website that has a feature developed and ready for production wants to know if this feature is making a good…
The hidden costs of web scraping

2024年7月26日

The hidden costs of web scraping

During my long developer career I encountered multiple cases when companies were taking data directly from websites…
Gravity of monoliths in feature-centered frontend projects

2024年7月5日

Gravity of monoliths in feature-centered frontend projects

During more than 10 years I had some pleasure of working with different projects with various codebases and code…
How to deal with third party API integration issues for web services?

2024年4月10日

How to deal with third party API integration issues for web services?

How to deal with third party API integration issues for web services? Many web services offer beautiful APIs that solve…
AWS Lambda: Accessing private VPC resources and internet without NAT gateway

2024年2月18日

AWS Lambda: Accessing private VPC resources and internet without NAT gateway

There is a commonly known design decision of AWS to launch lambda in a separate VPC that belongs to AWS itself. This…
AWS - optimizing Lambda usage trough DynamoDB with CloudWatch Rules

2023年8月22日

AWS - optimizing Lambda usage trough DynamoDB with CloudWatch Rules

In moebel.de we use AWS lambda for many projects.

1 条评论
Using assembly in node.js

2023年7月25日

Using assembly in node.js

This is my successful timeboxed attempt to integrate assembly code in a node.js project for fun.

See all articles

Black box approach

Proactive approach

Performance tuning

Real case study

traceId on client side

Conclusion

Ivan Vokhmin的更多文章

(GitLab) CI Pipeline Tricks: Automating Aurora Serverless v2 Cluster Restorations

Datadog vs self-hosted grafana/loki for observability - migration case

Executing scheduled serverless tasks with AWS ECS fargate or lambda

Technical challenges of AB test user segregation

The hidden costs of web scraping

Gravity of monoliths in feature-centered frontend projects

How to deal with third party API integration issues for web services?

AWS Lambda: Accessing private VPC resources and internet without NAT gateway

AWS - optimizing Lambda usage trough DynamoDB with CloudWatch Rules

Using assembly in node.js