DevOps Nightmare: Apache going down

We had deployed an Apache 2.4.0 few years back in AWS using EC2 Amazon Linux 1. Because of certain dependencies and overlook it was never updated and we reached a point where we had PHP 5.6 running with MySQL 5.6. AWS was pushing us to upgrade to atleast MySQL 5.8 and PHP 7.2. But we were stuck with fear that we will not be able to test the entire tech stack and ensure bug free experience for over 100 applications deployed on load balanced cluster.

This fear and lack of plan to upgrade lead to incidental issues in early 2021 but come June and apache started going down every few hours and later every few minutes. We worked hard to ensure we have monitoring in place to ensure apache is rebooted when it fails but still were not willing to take up upgrade of entire application stack as it would open can of worms. Our problem was most of the applications were onboarded from other companies and we didnt have enough documentation and insight to do a gap analysis or identify risks early. We were running real time webhooks for over 50 services, credit card processing, subscription management and lot of other important financial transactions.

But it was getting worse everyday with clients noticing downtimes and complaining. So we decided debug issues. Read Apache error logs, access logs but could not find much. Then we thought it was a security patch applied to Linux which broke this. We rolled back cluster but that didnt help. Then we decided to fine tune Apache. We did below optimisations for Apache

  1. Reviewed all the modules we were loading and limited it to what we needed absolutely
  2. Changed MPM from PREFORK to Event
  3. Configured MinSpareServers, MaxSpareServers and StartServers
  4. Setup MaxRequestsPerChild, KeepAlive, KeepAliveTimeOut
  5. Enabled Compression and Caching
  6. Moved static assets to cloudfront
  7. Tweaked MySQL and PHP setting to enabled persistent connection
  8. Setup FPM

After doing all this our applications were running quite fast and we thought we had put behind us server downtime but it was just too early to relax. After few days things went bad for us and we were totally out of idea.

Then we decided to monitor database and figured out we were getting too many requests to update certain database tables but our application code was not suppose to do that and we could not find the origin of the queries. So as a workaround we tried to limit the connections from specific applications to ensure only misbehaving application would fail. This certainly helped. To investigate further and determine source of this anomaly we started monitoring database and application server communication. NETSTAT was a great tool to rescue.

In all the chaos, our devops team proposed moving to NGINX from Apache and move to latest Amazon Linux2 using latest servers available in ec2. We did agree to take this risk. We split our applications into 2. Old applications which needed PHP 5.6 to run and newer ones which we knew could work well on latest stack without much trouble. This migration did the trick and we were able to reduce our response time from 1 sec to average 300 ms. Our page load times also went from 1.5 seconds to 500ms.

Lessons learnt:

  1. Always upgrade to latest stable release of OS
  2. Update server hardware often. In AWS that means change machine type.
  3. Monitor internal and external communication
  4. Rotate log and ensure they are scanned often
  5. MySQL, PHP and other dependencies should always be uptodate
  6. Break large clusters into smaller single purpose clusters
  7. Rely more on middle layers like elastic cache, cloudfront for better and faster delivery of content
  8. Monitor RDS and setup alarms for unexpected situations

If you face similar situation and need help in resolving application performance issues or need to stress test your application, we can help. Talk to us to figure out right strategy to track server issues. We help you identify pitfalls in existing applications. We help you identify gaps in your AWS infra.

Reach out to us - [email protected]

要查看或添加评论,请登录

Saurabh Bhandari的更多文章

社区洞察

其他会员也浏览了