登录查看更多内容

Running Ad Exchange at 100K+ QPS?—?Infrastructure

Ritwik Kumar

Building Flexport India | Ex LinkedIn, InMobi, Lazada

发布日期: 2018年4月10日

+ 关注

Introduction

The three most favorite question which any engineer at InMobi loves to reply and boast are:

What is the traffic volume you handle daily?

Maximum concurrent traffic volume ?

What is the SLA you maintain ?

The answer to above question are pretty humongous.

On a good day we handle more than 10 Billion Ad Request arising from all corners of the world.

This translates to peak traffic of 100,000+ QPS (Queries per second).

Our SLAs are in Milliseconds.

In this blog and series of other blogs which will follow I will walk you through our journey of building such a system. Particularly in this blog I will focus on the infrastructure aspect of of running an Ad Exchange. As it is aptly said, Rome was not built in a day so are highly scalable and robust systems. Everyday brings along new challenges and learning for us. In this series of blogs I will also try to walk through those mistakes, solutions and learnings we got out of it.

Geographical Segregation of Requests

When the SLA in question is milliseconds, even the distance between the point where the request originated and from where it is going to be served matters. To address this issue at InMobi we maintain multiple Data Centres at different locations across the world. The request arising from the device ends up in the closest DC with the help of GSLB (Global server load balancing).

How GSLB Works:?With ordinary DNS, when a client sends a domain name system (DNS) request, it receives a list of IP addresses of the domain or service. Generally, the client chooses the first IP address in the list and initiates a connection with that server. The DNS server uses a technique called DNS round robin to rotate through the IPs on the list, sending the first IP address to the end of the list and promoting the others after it responds to each DNS request. This technique ensures equal distribution of the load, but it does not support disaster recovery, load balancing based on load or proximity of servers, or persistence.

领英推荐

Route Analyzer vs. Reachability Analyzer vs. Network…

Jon Bonso 7 个月前

Azure Load Balancer (Part-1)

Ankit Ranjan (DevOps Engineer) 5 个月前

Understanding F5 LTM: A Beginner's Guide to Local…

??AM ??IXIT ? 5 个月前

In GSLB world, the appliances use the DNS infrastructure to connect the client to the data center that best meets the criteria that you set. The criteria can designate the least loaded data center, the closest data center, the data center that responds most quickly to requests from the client’s location or a combination of those metrics. At InMobi we use the criteria of the closest data centre combined with the load handling capacity of each Data Center.

Load Balancing

Load balancing refers to efficiently distributing incoming network traffic across a group of backend servers, also known as a server farm or?server pool.?A Load Balancer also referred as LB acts as the “traffic cop” sitting in front of your servers and routing client requests across all servers capable of fulfilling those requests in a manner that maximizes speed and capacity utilization and ensures that no one server is overworked, which could degrade performance. If a single server goes down, the load balancer redirects traffic to the remaining online servers. When a new server is added to the server group, the load balancer automatically starts to send requests to it. In this manner, a load balancer performs the following functions:

Distributes client requests or network load efficiently across multiple servers
Ensures high availability and reliability by sending requests only to servers that are online
Provides the flexibility to add or subtract servers as demand dictates
Acts as place where SSL offloading takes place

Hardware vs. Software Load Balancing:?Load balancers typically come in two flavors: hardware-based and software-based. Vendors of hardware-based solutions load proprietary software onto the machine they provide, which often uses specialized processors. Software solutions generally run on commodity hardware, making them less expensive and more flexible. At InMobi we use a mix of hardware and software LBs. We use Citrix NetScaler ADC (Application Delivery Controller) as hardware LB and Nginx, HAProxy as software LB.

Citrix NetScaler ADC?does the dual task of LB and GSLB where it extends the core L4 and L7 capabilities as they are applicable across geographically distributed server farms. At InMobi Netscaler also guarantees 100% uptime in cases of whole DC failures by shifting the whole traffic of a DC to an available DC. At InMobi?Nginx?is primarily used to to cut off our Ad servers and return a No-Ad in case the Ad servers are unable to maintain a strict SLA due to any reason. The InMobi Ad server uses?HAProxy?to talk with internally load balanced services, databases and applications.

Alerting and Monitoring

In a high scale systems like Ad Exchanges alerting and monitoring becomes the crux of your system. At our servers if we turn on debug logging even for 15 seconds it produces log files upto 3 GBs. In such situation if you do not have proper monitoring in place, finding any issue becomes near to impossible. Similarly alerts are equally important if you want to maintain 100% uptime by forecasting any system degradation before your consumers realize it. The major components we use for alerting and monitoring are following:

Graphite:?Graphite is the backbone of our alerts and monitoring infrastructure. Graphite is an enterprise-ready monitoring tool that runs equally well on cheap hardware or Cloud infrastructure. Teams use Graphite to track the performance of their applications, services, and networked servers. It provides new generation of monitoring tools, making it easier than ever to store, retrieve, share, and visualize time-series data.

Grafana:?We use Grafana as a dash-boarding platform for all applications and services over graphite. Grafana includes a built in Graphite query parser that takes writing graphite metric expressions to a whole new level. Expressions in Grafana are easier to read and faster to edit than ever.

Nagios:?Nagios is a powerful tool that provides you with instant awareness of your organization’s mission-critical IT infrastructure. Nagios allows you to detect and repair problems and mitigate future issues before they affect end-users and customers. We configure Nagios to connect to Graphite to monitor critical IT infrastructure components, including system metrics, network protocols, applications, services, servers, and network infrastructure.

PagerDuty:?PagerDuty’s digital operations management platform empowers teams to proactively mitigate customer-impacting issues by automatically turning any signal into the right insight and action so you can innovate with confidence. At InMobi all the alerts raised by Nagios are fed into pagerduty which helps us in quick resolution of on-call issues. Pagerduty also helps in on-call rotation and following proper escalation policies for any on call.

I hope you enjoyed reading this blog if you have made it till here. Leave behind your comments and valuable suggestions. Also do mention which component/s you would would like me to dig deeper into. The second blog in the series which will focus on overall System Design and Architecture of InMobi Exchange (IX) can be accessed from:?Running Ad Exchange at 100K+ QPS — System Design.

Vishal S.

Director/Lead-Technical Program Management at Visa | IIM-B

6 年

Nicely written article, thanks for sharing.

Rajashekhar Choukimath

Staff Engineer 2 at InMobi

6 年

Nice one

Vishal Verma

Principal GEM at Microsoft

6 年

Very well explained!

查看更多评论

要查看或添加评论，请登录

Ritwik Kumar的更多文章

How to Inculcate a Reading Habit

2019年2月6日

How to Inculcate a Reading Habit

A reader lives a thousand lives before he dies. The man who never reads lives only one.

2 条评论
Arbitrage Trading with Cryptocurrencies

2018年4月29日

Arbitrage Trading with Cryptocurrencies

We are living in an era where the world is experiencing some of the finest technological inventions of human history…

2 条评论
Second Price Auction Dynamics

2018年4月20日

Second Price Auction Dynamics

Introduction Do you guys know, every-time you see an Advertisement in your Facebook, Instagram, snapchat, google search…
Running Ad Exchange at 100K+ QPS?—?System Design

2018年4月12日

Running Ad Exchange at 100K+ QPS?—?System Design

This blog is second in a series of blogs written on building a highly scalable, fault tolerant and robust Ad Exchange…
Social Media Explained

2016年5月19日

Social Media Explained

See all articles

Running Ad Exchange at 100K+ QPS?—?Infrastructure

Ritwik Kumar

Building Flexport India | Ex LinkedIn, InMobi, Lazada

Introduction

Geographical Segregation of Requests

领英推荐

Load Balancing

Alerting and Monitoring

Ritwik Kumar的更多文章

社区洞察

其他会员也浏览了

Software Defined Datacenter/Network

Why you should choose Dell EMC VxRail as your hyperconverged infrastructure

Enterprise Infrastructure Servers Market – Major Technology Giants in Buzz Again

UCS B200 M4 vs B200 M3: Memory Modules To Choose

?? High Availability Architectures: Ensuring Uninterrupted Service ??

IT Fundamentals

5 Benefits of Load Balancer in Digital Ocean: Very?Cheap

Hyper Converged Infrastructure

How to tackle NFV AAA deployment challenges

1 Gbps uplink speed now available countrywide

Introduction

Geographical Segregation of Requests

领英推荐

Load Balancing

Alerting and Monitoring

Ritwik Kumar的更多文章

How to Inculcate a Reading Habit

Arbitrage Trading with Cryptocurrencies

Second Price Auction Dynamics

Running Ad Exchange at 100K+ QPS?—?System Design

Social Media Explained

社区洞察

其他会员也浏览了

Software Defined Datacenter/Network

Why you should choose Dell EMC VxRail as your hyperconverged infrastructure

Enterprise Infrastructure Servers Market – Major Technology Giants in Buzz Again

UCS B200 M4 vs B200 M3: Memory Modules To Choose

?? High Availability Architectures: Ensuring Uninterrupted Service ??

IT Fundamentals

5 Benefits of Load Balancer in Digital Ocean: Very?Cheap

Hyper Converged Infrastructure

How to tackle NFV AAA deployment challenges

1 Gbps uplink speed now available countrywide