REST API backend optimization on Heroku
Web app

REST API backend optimization on Heroku

I'd like to share our experience optimizing our REST API back-end running on Heroku from average response times of ~850ms to ~150ms. We scaled our user base from about 1300 users from last October to 130k+ today. So we soon began facing problems with speed and responsiveness of our web application as more and more users started using our app. We analyzed our queries, checked db indexes and still, we were far away from where we wanted to be. We scaled our web dynos vertically and that did help a bit, then added awesome HireFire service to scale our dynos horizontally in times of increased traffic, which helped further, but the app could still slow down considerably during peak traffic times. So it was time to dig in some more. Luckily, Heroku offers the awesome New Relic add-on, which I warmly suggest you use to monitor your backend performance. With it, we could see we were spending quite some time in the database (postgresql) layer. Problem was, we were still on quite a basic standard-0 plan, which offers 1GB of RAM. Our data size was already exceeding 5GB, so it was not possible for postgres to cache it in memory, resulting in slow hard drive IO operations. So at that time, the average response time was ~850ms, as can be seen in the screenshot below. We upgraded the postgres plan to standard-4 with 15GB of RAM and sure enough, we immediately improved the performance by over 450ms.

No alt text provided for this image

Awesome, the responsiveness of the app improved immediately. Still, we were experiencing frequent scaling. We found out, our backend (Django/Python) was opening and closing database connections on each request. PgBouncer to the rescue! PgBouncer is a connection pooler for Heroku and is easy to setup. It helps limit the number of connections to you database (and with smaller plans, you can quickly run out of db connections), but more importantly, it enables db connection reuse, so we no longer have to go through expensive opening of database connections. Boom. Another 150ms shaved off our response times. Additionally, we had one very slow endpoint (due to lots of dynamic data) and managed to identify all possible combinations (16 of them) and decided to store the results in Redis. 7sec endpoint down to 250ms. Now we're talking! We immediately noticed much less scaling of our dynos and further improved responsiveness. But still, for no apparent reason, our dynos would scale up even in times of low traffic. Something was up and needed investigation. Turns out, that the WSGI server that we were using - Gunicorn - is having problems with slow clients. So when a slow client with bad internet connection uses your app, basically it blocks one of your Gunicorn worker processes until all the data is transferred to the slow client. Mind you, Heroku incorrectly suggests using Gunicorn as your web server, even as Gunicorn documentation itself suggests you need to put a proxy server with slow clients buffering capability in front of it.

Gunicorn uses a pre-forking process model by default. This means that network requests are handed off to a pool of worker processes, and that these worker processes take care of reading and writing the entire HTTP request to the client. If the client has a fast network connection, the entire request/response cycle takes a fraction of a second. However, if the client is slow (or deliberately misbehaving), the request can take much longer to complete. Because Gunicorn has a relatively small (2x CPU cores) pool of workers, if can only handle a small number of concurrent requests. If all the worker processes become tied up waiting for network traffic, the entire server will become unresponsive. To the outside world, your web application will cease to exist. (source)

So one option was to install the waitress WSGI server in place of Gunicorn, but we didn't have much luck with that. We experienced erratic behavior and slow response times, and tried a couple of different settings to no avail.

So the remaining option was to install the Nginx infront of Gunicorn. Voila! We are now down to ~150ms response rates, and we're really happy with the app's responsiveness now. Even though it is hard to directly or immediately measure the impact this will have on our users, we believe our users will appreciate smoother application and this should have positive effects on our user churn rate.

No alt text provided for this image

Optimization is an ongoing process. We spend 2-3 days each month making sure we stay up to speed. We suggest you do too.

要查看或添加评论,请登录

Tadej Krevh的更多文章

  • How to build a SaaS product in 8 days

    How to build a SaaS product in 8 days

    In autumn and winter I usually have some free time, so I take up on some new challenges, learn new technologies, or…

  • Django Speed Optimizations

    Django Speed Optimizations

    Optimize your queries - use select_related and prefetch_related on your queryset, use caching (Redis/cacheops work…

社区洞察

其他会员也浏览了