Monitoring 3rd party API response time to proactively improve performance (moebel.de case)
Ivan Vokhmin
Lead Engineer Frontend @ moebel.de Einrichten & Wohnen GmbH | AWS, Team Leadership, Software Architecture, AI
Usually web apps (websites with SPA) are monitored for performance as “black boxes” - their response time (Time To First Byte, TTFB) is measured without looking what is happening inside the app. Even if the app is fine-tuned, it will still fetch slow/unstable 3rd party APIs (or your own backend APIs) to render user-requested content.
And suddenly, a question raises - why did app response time change without us releasing anything?
Black box approach
For some time, web app response time was taken as granted. There are tools that can monitor and report it consistently, like freshping/pingdom and CWV/lighthouse. However, only a developer can benchmark a request on a local machine to get all the insights on app internal timings. Usually, automatically reporting app response time is enough in case of simple requests that only invoke 1-2 external APIs if any. The app is treated as black box: it has request and response time. If it takes too long - a debugger must go in.
The problem of that approach is lack of proactiveness. Since the app is a black box, we know it got slower, but is it due to changed code or degraded backend performance? It is especially hard to tell during the business hours in case of frequent deployments.
Proactive approach
Reacting to changed response times always poses a challenge. It can be related to recent releases or some features that can be enabled through feature flags. However, some changes are hard to originate. Therefore, apart from monitoring our app external properties (like CPU/memory/network usage) we should dive into black box to see how much time is spent communicating with external APIs and services.
In moebel.de, I implemented the following approach:
const response = await wrapPromiseWithReporter(
fetch(‘https://some-api.com’,
‘API name to show on the graph’,
traceId,
{ extra API logging info }));
// ... use response data ...
These measures are helping us to get following data:
Of course, this approach has some downsides. Like, we don’t measure time to the first byte, we measure the whole request-response time, especially when 3rd party libraries are used (like algolia SDK). Means we don’t have perfect stats. Also, reporting the metrics takes time/network/CPU. So we generate traceId only for ~1% of random users.
Performance tuning
By investigating every response time for one traceId, we can say which APIs contribute the most to the networking time with 3rd party APIs and slowing down our webserver responses. They are good candidates for internal caching solutions to improve performance.
Real case study
One day, we got a visible response time increase which was hard to originate. It was not related with any of our recent product deployments (see picture 1 below).
However, in the API response timing graph, it was very clear which API is misbehaving (it looks like this after filtering down, see picture 2 below).
It took us some time and effort to bring API back to normal, but after the fix, response time went to sane values again.
traceId on client side
Reporting measurements don't end up in our next.js servers. The traceId is preserved in next data and transferred to the client. Then, we can see how much time it takes web clients to interact with different APIs. However, this heavily relies on user connection - some cellular users have insanely long response times from some APIs (like 30 seconds) on slow 2g-3g connection. But still, we are able to get valuable anonymous field data for real devices.
Conclusion
Measuring and reporting 3rd party and internal API response time for small percent of users can help improve web app performance and act proactively in case of unforeseen increases in response time for 3rd party services.