DevOps Nightmare: Apache going down
Saurabh Bhandari
CTO, Ensuring Your Business Flows Smoothly With Custom Technology Solutions | AWS Certified Solutions Architect – Professional
We had deployed an Apache 2.4.0 few years back in AWS using EC2 Amazon Linux 1. Because of certain dependencies and overlook it was never updated and we reached a point where we had PHP 5.6 running with MySQL 5.6. AWS was pushing us to upgrade to atleast MySQL 5.8 and PHP 7.2. But we were stuck with fear that we will not be able to test the entire tech stack and ensure bug free experience for over 100 applications deployed on load balanced cluster.
This fear and lack of plan to upgrade lead to incidental issues in early 2021 but come June and apache started going down every few hours and later every few minutes. We worked hard to ensure we have monitoring in place to ensure apache is rebooted when it fails but still were not willing to take up upgrade of entire application stack as it would open can of worms. Our problem was most of the applications were onboarded from other companies and we didnt have enough documentation and insight to do a gap analysis or identify risks early. We were running real time webhooks for over 50 services, credit card processing, subscription management and lot of other important financial transactions.
But it was getting worse everyday with clients noticing downtimes and complaining. So we decided debug issues. Read Apache error logs, access logs but could not find much. Then we thought it was a security patch applied to Linux which broke this. We rolled back cluster but that didnt help. Then we decided to fine tune Apache. We did below optimisations for Apache
After doing all this our applications were running quite fast and we thought we had put behind us server downtime but it was just too early to relax. After few days things went bad for us and we were totally out of idea.
Then we decided to monitor database and figured out we were getting too many requests to update certain database tables but our application code was not suppose to do that and we could not find the origin of the queries. So as a workaround we tried to limit the connections from specific applications to ensure only misbehaving application would fail. This certainly helped. To investigate further and determine source of this anomaly we started monitoring database and application server communication. NETSTAT was a great tool to rescue.
领英推荐
In all the chaos, our devops team proposed moving to NGINX from Apache and move to latest Amazon Linux2 using latest servers available in ec2. We did agree to take this risk. We split our applications into 2. Old applications which needed PHP 5.6 to run and newer ones which we knew could work well on latest stack without much trouble. This migration did the trick and we were able to reduce our response time from 1 sec to average 300 ms. Our page load times also went from 1.5 seconds to 500ms.
Lessons learnt:
If you face similar situation and need help in resolving application performance issues or need to stress test your application, we can help. Talk to us to figure out right strategy to track server issues. We help you identify pitfalls in existing applications. We help you identify gaps in your AWS infra.
Reach out to us - [email protected]