登录查看更多内容

DevOps Nightmare: Apache going down

Saurabh Bhandari

CTO, Ensuring Your Business Flows Smoothly With Custom Technology Solutions | AWS Certified Solutions Architect – Professional

发布日期: 2021年8月7日

We had deployed an Apache 2.4.0 few years back in AWS using EC2 Amazon Linux 1. Because of certain dependencies and overlook it was never updated and we reached a point where we had PHP 5.6 running with MySQL 5.6. AWS was pushing us to upgrade to atleast MySQL 5.8 and PHP 7.2. But we were stuck with fear that we will not be able to test the entire tech stack and ensure bug free experience for over 100 applications deployed on load balanced cluster.

This fear and lack of plan to upgrade lead to incidental issues in early 2021 but come June and apache started going down every few hours and later every few minutes. We worked hard to ensure we have monitoring in place to ensure apache is rebooted when it fails but still were not willing to take up upgrade of entire application stack as it would open can of worms. Our problem was most of the applications were onboarded from other companies and we didnt have enough documentation and insight to do a gap analysis or identify risks early. We were running real time webhooks for over 50 services, credit card processing, subscription management and lot of other important financial transactions.

But it was getting worse everyday with clients noticing downtimes and complaining. So we decided debug issues. Read Apache error logs, access logs but could not find much. Then we thought it was a security patch applied to Linux which broke this. We rolled back cluster but that didnt help. Then we decided to fine tune Apache. We did below optimisations for Apache

Reviewed all the modules we were loading and limited it to what we needed absolutely
Changed MPM from PREFORK to Event
Configured MinSpareServers, MaxSpareServers and StartServers
Setup MaxRequestsPerChild, KeepAlive, KeepAliveTimeOut
Enabled Compression and Caching
Moved static assets to cloudfront
Tweaked MySQL and PHP setting to enabled persistent connection
Setup FPM

After doing all this our applications were running quite fast and we thought we had put behind us server downtime but it was just too early to relax. After few days things went bad for us and we were totally out of idea.

Then we decided to monitor database and figured out we were getting too many requests to update certain database tables but our application code was not suppose to do that and we could not find the origin of the queries. So as a workaround we tried to limit the connections from specific applications to ensure only misbehaving application would fail. This certainly helped. To investigate further and determine source of this anomaly we started monitoring database and application server communication. NETSTAT was a great tool to rescue.

领英推荐

SRE/Devops/Sysadmin newsletter : 2024/03

Xavier Pestel 11 个月前

Infrastructure automation: I reduced to just 2 steps…

Allwell Agwu-Okoro 10 个月前

Popeye - A Kubernetes Cluster Sanitizer

Christopher Adamson 1 年前

In all the chaos, our devops team proposed moving to NGINX from Apache and move to latest Amazon Linux2 using latest servers available in ec2. We did agree to take this risk. We split our applications into 2. Old applications which needed PHP 5.6 to run and newer ones which we knew could work well on latest stack without much trouble. This migration did the trick and we were able to reduce our response time from 1 sec to average 300 ms. Our page load times also went from 1.5 seconds to 500ms.

Lessons learnt:

Always upgrade to latest stable release of OS
Update server hardware often. In AWS that means change machine type.
Monitor internal and external communication
Rotate log and ensure they are scanned often
MySQL, PHP and other dependencies should always be uptodate
Break large clusters into smaller single purpose clusters
Rely more on middle layers like elastic cache, cloudfront for better and faster delivery of content
Monitor RDS and setup alarms for unexpected situations

If you face similar situation and need help in resolving application performance issues or need to stress test your application, we can help. Talk to us to figure out right strategy to track server issues. We help you identify pitfalls in existing applications. We help you identify gaps in your AWS infra.

Reach out to us - [email protected]

要查看或添加评论，请登录

Saurabh Bhandari的更多文章

Digital Transformation of Trial Lawyers College with Segwik

2024年1月11日

Digital Transformation of Trial Lawyers College with Segwik

Trial Lawyers College (TLC), a prestigious institution dedicated to legal education, approached us with several…

1 条评论
DevOps Toolkit: Backing up MongoDB on AWS EC2 to S3

2021年8月11日

DevOps Toolkit: Backing up MongoDB on AWS EC2 to S3

We work with a fintech client which relies on MongoDB for their data storage. When they started they didnt want to…

DevOps Nightmare: Apache going down

Saurabh Bhandari

CTO, Ensuring Your Business Flows Smoothly With Custom Technology Solutions | AWS Certified Solutions Architect – Professional

领英推荐

Saurabh Bhandari的更多文章

社区洞察

其他会员也浏览了

Popeye - A Kubernetes Cluster Sanitizer

A Homemade FaaS Platform

Project Day 4: Using your Web Finger

How to Refactor a Monolith to a Microservices Application

Rabbit-going-NATS: share the queue easily.

MicroService All In One

Alternative to Kubernetes: APACHE MESOS

Deploy Apps in Kubernetes cluster with Nginx and Letsencrpyt (2023 Edition)

The cards of Joker

领英推荐

Saurabh Bhandari的更多文章

Digital Transformation of Trial Lawyers College with Segwik

DevOps Toolkit: Backing up MongoDB on AWS EC2 to S3

社区洞察

其他会员也浏览了

Popeye - A Kubernetes Cluster Sanitizer

A Homemade FaaS Platform

Project Day 4: Using your Web Finger

How to Refactor a Monolith to a Microservices Application

Rabbit-going-NATS: share the queue easily.

MicroService All In One

Alternative to Kubernetes: APACHE MESOS

Deploy Apps in Kubernetes cluster with Nginx and Letsencrpyt (2023 Edition)

The cards of Joker