Running Ad Exchange at 100K+ QPS?—?Infrastructure
Introduction
The three most favorite question which any engineer at InMobi loves to reply and boast are:
What is the traffic volume you handle daily?
Maximum concurrent traffic volume ?
What is the SLA you maintain ?
The answer to above question are pretty humongous.
On a good day we handle more than 10 Billion Ad Request arising from all corners of the world.
This translates to peak traffic of 100,000+ QPS (Queries per second).
Our SLAs are in Milliseconds.
In this blog and series of other blogs which will follow I will walk you through our journey of building such a system. Particularly in this blog I will focus on the infrastructure aspect of of running an Ad Exchange. As it is aptly said, Rome was not built in a day so are highly scalable and robust systems. Everyday brings along new challenges and learning for us. In this series of blogs I will also try to walk through those mistakes, solutions and learnings we got out of it.
Geographical Segregation of Requests
When the SLA in question is milliseconds, even the distance between the point where the request originated and from where it is going to be served matters. To address this issue at InMobi we maintain multiple Data Centres at different locations across the world. The request arising from the device ends up in the closest DC with the help of GSLB (Global server load balancing).
How GSLB Works:?With ordinary DNS, when a client sends a domain name system (DNS) request, it receives a list of IP addresses of the domain or service. Generally, the client chooses the first IP address in the list and initiates a connection with that server. The DNS server uses a technique called DNS round robin to rotate through the IPs on the list, sending the first IP address to the end of the list and promoting the others after it responds to each DNS request. This technique ensures equal distribution of the load, but it does not support disaster recovery, load balancing based on load or proximity of servers, or persistence.
领英推荐
In GSLB world, the appliances use the DNS infrastructure to connect the client to the data center that best meets the criteria that you set. The criteria can designate the least loaded data center, the closest data center, the data center that responds most quickly to requests from the client’s location or a combination of those metrics. At InMobi we use the criteria of the closest data centre combined with the load handling capacity of each Data Center.
Load Balancing
Load balancing refers to efficiently distributing incoming network traffic across a group of backend servers, also known as a server farm or?server pool.?A Load Balancer also referred as LB acts as the “traffic cop” sitting in front of your servers and routing client requests across all servers capable of fulfilling those requests in a manner that maximizes speed and capacity utilization and ensures that no one server is overworked, which could degrade performance. If a single server goes down, the load balancer redirects traffic to the remaining online servers. When a new server is added to the server group, the load balancer automatically starts to send requests to it. In this manner, a load balancer performs the following functions:
Hardware vs. Software Load Balancing:?Load balancers typically come in two flavors: hardware-based and software-based. Vendors of hardware-based solutions load proprietary software onto the machine they provide, which often uses specialized processors. Software solutions generally run on commodity hardware, making them less expensive and more flexible. At InMobi we use a mix of hardware and software LBs. We use Citrix NetScaler ADC (Application Delivery Controller) as hardware LB and Nginx, HAProxy as software LB.
Citrix NetScaler ADC?does the dual task of LB and GSLB where it extends the core L4 and L7 capabilities as they are applicable across geographically distributed server farms. At InMobi Netscaler also guarantees 100% uptime in cases of whole DC failures by shifting the whole traffic of a DC to an available DC. At InMobi?Nginx?is primarily used to to cut off our Ad servers and return a No-Ad in case the Ad servers are unable to maintain a strict SLA due to any reason. The InMobi Ad server uses?HAProxy?to talk with internally load balanced services, databases and applications.
Alerting and Monitoring
In a high scale systems like Ad Exchanges alerting and monitoring becomes the crux of your system. At our servers if we turn on debug logging even for 15 seconds it produces log files upto 3 GBs. In such situation if you do not have proper monitoring in place, finding any issue becomes near to impossible. Similarly alerts are equally important if you want to maintain 100% uptime by forecasting any system degradation before your consumers realize it. The major components we use for alerting and monitoring are following:
Graphite:?Graphite is the backbone of our alerts and monitoring infrastructure. Graphite is an enterprise-ready monitoring tool that runs equally well on cheap hardware or Cloud infrastructure. Teams use Graphite to track the performance of their applications, services, and networked servers. It provides new generation of monitoring tools, making it easier than ever to store, retrieve, share, and visualize time-series data.
Grafana:?We use Grafana as a dash-boarding platform for all applications and services over graphite. Grafana includes a built in Graphite query parser that takes writing graphite metric expressions to a whole new level. Expressions in Grafana are easier to read and faster to edit than ever.
Nagios:?Nagios is a powerful tool that provides you with instant awareness of your organization’s mission-critical IT infrastructure. Nagios allows you to detect and repair problems and mitigate future issues before they affect end-users and customers. We configure Nagios to connect to Graphite to monitor critical IT infrastructure components, including system metrics, network protocols, applications, services, servers, and network infrastructure.
PagerDuty:?PagerDuty’s digital operations management platform empowers teams to proactively mitigate customer-impacting issues by automatically turning any signal into the right insight and action so you can innovate with confidence. At InMobi all the alerts raised by Nagios are fed into pagerduty which helps us in quick resolution of on-call issues. Pagerduty also helps in on-call rotation and following proper escalation policies for any on call.
I hope you enjoyed reading this blog if you have made it till here. Leave behind your comments and valuable suggestions. Also do mention which component/s you would would like me to dig deeper into. The second blog in the series which will focus on overall System Design and Architecture of InMobi Exchange (IX) can be accessed from:?Running Ad Exchange at 100K+ QPS — System Design.
Director/Lead-Technical Program Management at Visa | IIM-B
6 年Nicely written article, thanks for sharing.
Staff Engineer 2 at InMobi
6 年Nice one
Principal GEM at Microsoft
6 年Very well explained!