Monitoring 3rd party API response time to proactively improve performance (moebel.de case)
3rd party API response time measurement for web apps by Ivan Vokhmin

Monitoring 3rd party API response time to proactively improve performance (moebel.de case)

Usually web apps (websites with SPA) are monitored for performance as “black boxes” - their response time (Time To First Byte, TTFB) is measured without looking what is happening inside the app. Even if the app is fine-tuned, it will still fetch slow/unstable 3rd party APIs (or your own backend APIs) to render user-requested content.

And suddenly, a question raises - why did app response time change without us releasing anything?

Black box approach

For some time, web app response time was taken as granted. There are tools that can monitor and report it consistently, like freshping/pingdom and CWV/lighthouse. However, only a developer can benchmark a request on a local machine to get all the insights on app internal timings. Usually, automatically reporting app response time is enough in case of simple requests that only invoke 1-2 external APIs if any. The app is treated as black box: it has request and response time. If it takes too long - a debugger must go in.

The problem of that approach is lack of proactiveness. Since the app is a black box, we know it got slower, but is it due to changed code or degraded backend performance? It is especially hard to tell during the business hours in case of frequent deployments.

Proactive approach

Reacting to changed response times always poses a challenge. It can be related to recent releases or some features that can be enabled through feature flags. However, some changes are hard to originate. Therefore, apart from monitoring our app external properties (like CPU/memory/network usage) we should dive into black box to see how much time is spent communicating with external APIs and services.

In moebel.de, I implemented the following approach:

  • for some users, we generate a special parameter on request time (server-side), called traceId
  • when traceId is present, we send networking statistics from external APIs to our metrics collector
  • collecting statistics is done by wrapping every request (like fetch) with a function that transparently measures/reports request time, like

const response = await wrapPromiseWithReporter(
  fetch(‘https://some-api.com’, 
        ‘API name to show on the graph’, 
        traceId, 
        { extra API logging info }));
// ... use response data ...        

These measures are helping us to get following data:

  • how much time it takes for networking with external APIs in general
  • how much time is spent with 3rd party APIs networking for every user (sum of all networking time by traceId)

Of course, this approach has some downsides. Like, we don’t measure time to the first byte, we measure the whole request-response time, especially when 3rd party libraries are used (like algolia SDK). Means we don’t have perfect stats. Also, reporting the metrics takes time/network/CPU. So we generate traceId only for ~1% of random users.

Performance tuning

By investigating every response time for one traceId, we can say which APIs contribute the most to the networking time with 3rd party APIs and slowing down our webserver responses. They are good candidates for internal caching solutions to improve performance.

Real case study

One day, we got a visible response time increase which was hard to originate. It was not related with any of our recent product deployments (see picture 1 below).


Picture 1. TTFB for web-app page

However, in the API response timing graph, it was very clear which API is misbehaving (it looks like this after filtering down, see picture 2 below).

Picture 2. Internal API response time


It took us some time and effort to bring API back to normal, but after the fix, response time went to sane values again.

traceId on client side

Reporting measurements don't end up in our next.js servers. The traceId is preserved in next data and transferred to the client. Then, we can see how much time it takes web clients to interact with different APIs. However, this heavily relies on user connection - some cellular users have insanely long response times from some APIs (like 30 seconds) on slow 2g-3g connection. But still, we are able to get valuable anonymous field data for real devices.

Conclusion

Measuring and reporting 3rd party and internal API response time for small percent of users can help improve web app performance and act proactively in case of unforeseen increases in response time for 3rd party services.

要查看或添加评论,请登录

Ivan Vokhmin的更多文章