登录查看更多内容

Optimizing Networks for Billions: Scaling Efficiency and Speed in AdTech

Shreeniwas V Iyer

Harnessing Talent, Delivering Impact | Engineering Leader | Ex-CTO | Startup Founder

发布日期: 2024年10月17日

To recap from my previous post, we process approximately 250 billion transactions daily, responding within 40ms-50ms. Of these, 220-230 billion are bidding endpoint requests, where we either bid or decide not to. Though we only convert a fraction of a percent into actual ads, this system favors scale, low latency, and cost over completeness, allowing us to disregard processing a small percentage of bid requests—i.e., if we don’t process 0.1% of our bid requests in any given time period, that’s largely fine.

Our pixel stack, responsible for impressions, clicks, conversions, and site events, handles tens of billions of transactions daily. Here, we prioritize completeness over latency and cost. While our software requirements differ between stacks, our network stack supports both.

Bidding Stack Network Architecture

Our bidding stack operates across four regions: US West, US East, Europe, and APAC, chosen for proximity to our markets and alignment with supply partners. We process far more traffic from supply partners (~220 billion requests daily) than from internet users (~10-20 billion requests daily).

We run our software on AWS, but AWS egress costs are high. Instead, we use our own backbone connecting all the AWS datacenters but sitting outside of the AWS infrastructure. Then we pull in a wire into the AWS network. This facility is called AWS DirectConnect (AWS DirectConnect)—a dedicated connection outside AWS infrastructure. For a high-scale network operation, a setup like this could save millions.

Peering: A Shortcut Through the Network Maze

One of the fascinating challenges in network design is reducing the number of hops between a source and destination. Each hop, or intermediary server, adds both latency and cost, which can accumulate quickly in a system as large as ours. Ideally, you'd want traffic between partners to flow as directly as possible, without bouncing across multiple networks. But the internet, by its nature, doesn’t always make this easy. IP routing tends to follow paths determined by factors like network congestion or peering agreements between internet providers, often causing packets to take unnecessarily long routes.

Enter Peering—an elegant solution to this problem. Peering allows us to bypass the chaotic routing of the open internet. Instead of sending packets through random intermediary hops, peering enables us to define specific routes—often as direct as a single hop from source to destination. Imagine drawing a straight line on a map between two points instead of zig-zagging through detours.

How does this work? When both we and our partners advertise our IP addresses within a Peering Exchange, such as those indexed in PeeringDB, requests between our systems are routed directly through the shortest path. If both parties are in the same peering exchange, the transmission happens across a single hop: source → peering exchange → destination. It’s as if we’ve built our own express lane for network traffic, avoiding the stoplights and traffic jams of the broader internet.

领英推荐

Ignite 2024 Update

John Savill 3 个月前

VPC Interface Endpoint vs Gateway Endpoint in AWS

Neal K. Davis 1 年前

What is NaaS and why is it important?

Analysys Mason 4 个月前

This isn’t just about reducing latency—though that’s a huge benefit. The fewer hops your traffic has to make, the fewer middlemen are involved, significantly reducing network transmission costs. For a system like ours, processing billions of transactions per day, the savings in both time and money are enormous. This elegant solution transforms what could be a chaotic, multi-hop network mess into a controlled, near-instantaneous communication channel.

By combining AWS DirectConnect, which reduces the cost of pulling data into our AWS environment, and strategic peering arrangements, we’ve optimized our network for both efficiency and scale. It’s a powerful, low-latency, cost-effective way to ensure that our systems remain responsive, even when handling an enormous volume of transactions.

Load Balancing

In the bare-metal world, we used Keepalived for load balancing at the border router. In AWS, we use AWS Route53, which supports up to 8 IP addresses per DNS entry. This setup balances traffic across our datacenters. While AWS supports load balancing through its own products, we found them too expensive for our use case. Instead, we built our own load balancing architecture using Nginx.

In the pixel stack, we stick to a standard Layer 7 architecture. However, in our bidding stack, we use Layer 4 load balancing. This means we can’t use Layer 7 features like routing by URL patterns, but it saves a ton of CPU and memory, making the load balancers lightning fast and cost-efficient.

Optimizing Communication: UDP over TCP

Our bidding stack consists of three layers: a load balancer, a Mux layer for operational tasks, and multiple bidders for bid computation. To minimize network costs and hops, we use AWS Placement Groups to keep these machines physically close together.

UDP cuts out all the handshake overhead from TCP

To make things further optimise, instead of using TCP for internal communication, we opted for UDP, eliminating the overhead of handshakes and retries. This significantly improves speed while maintaining an acceptable error rate. We’ve also fine-tuned UDP to handle larger packets more efficiently, enabling us to break and reassemble data with minimal overhead.

Put together, these optimisations save upwards of $5M a year compared to conventional designs and architectures. What such optimisation have you been performing on your end?

Olivia Sterner

Business Developer - North America @Humanlinker

5 个月

Handling billions of requests is no small feat. Balancing speed and cost while tackling latency? That's some serious engineering prowess. What challenges have you faced in scaling operations?

1 次回应

查看更多评论

要查看或添加评论，请登录

Shreeniwas V Iyer的更多文章

The True Cost of Building a (Software) Business

2024年12月3日

The True Cost of Building a (Software) Business

(pre-script: In theory, the following applies to all business, but capital investments in non-software business is…
Retention Beyond the Paycheck: The Culture Equation

2024年11月27日

Retention Beyond the Paycheck: The Culture Equation

My professional experience exceeded two decades earlier this year. Over 85% of it has been spent in unlisted…

1 条评论
When an Ant Halts an Elephant: Lessons in Troubleshooting

2024年11月16日

When an Ant Halts an Elephant: Lessons in Troubleshooting

Client A: An important client with whom we spent years building a reputation of good ad delivery performance. They use…

1 条评论
Scaling AdTech Engineering: Building a Culture of Ownership, Optimization, and Impact

2024年11月5日

Scaling AdTech Engineering: Building a Culture of Ownership, Optimization, and Impact

In my last three posts, we explored the foundational pillars of adtech engineering: managing trillions of real-time ad…
From Hops to Monoliths: Crafting High-Performance Architecture in AdTech

2024年10月26日

From Hops to Monoliths: Crafting High-Performance Architecture in AdTech

In my previous posts, Inside the World of Trillions: The Real-Time Ad Auctions Powering the Internet and Optimizing…
Inside the World of Trillions: The Real-Time Ad Auctions Powering the Internet

2024年10月11日

Inside the World of Trillions: The Real-Time Ad Auctions Powering the Internet

Most modern advertising happens in real-time. As consumers engage with content, the ads displayed next to that content…
Engineering Operational Excellence: Moving Beyond Incident Management

2024年10月4日

Engineering Operational Excellence: Moving Beyond Incident Management

In a recent post, I discussed how we use incidents as a tool to drive operational excellence. But perhaps it’s worth…

1 条评论
?? Ever wondered how to efficiently store and query geospatial data?

2024年9月29日

?? Ever wondered how to efficiently store and query geospatial data?

When it comes to applications needing proximity queries, PostGIS for Postgres is a go-to option among databases. Check…
Why the UK is Leading the Way in Tech Immigration

2024年9月27日

Why the UK is Leading the Way in Tech Immigration

As someone who has lived abroad for over 15 years, I’ve experienced firsthand how important stability is for migrants…
Balancing Speed and Stability: Continuous Learning from Incidents

2024年9月19日

Balancing Speed and Stability: Continuous Learning from Incidents

At Quantcast, we operate at an extremely large scale. On a typical day, we process between 220-250 billion transactions…

See all articles

Optimizing Networks for Billions: Scaling Efficiency and Speed in AdTech

Shreeniwas V Iyer

Harnessing Talent, Delivering Impact | Engineering Leader | Ex-CTO | Startup Founder

Bidding Stack Network Architecture

Peering: A Shortcut Through the Network Maze

领英推荐

Load Balancing

Optimizing Communication: UDP over TCP

Shreeniwas V Iyer的更多文章

社区洞察

其他会员也浏览了

Clouded Judgement: CISPE’s Microsoft surrender exposes broken cloud market

Cisco Partner Summit: Cisco's AI Infrastructure Play is a Strategic Move to Capture Enterprise AI Migration

Unicorn Massacre: Box Continues To Be Downgraded

Kentik Network News Issue #4: October 2024

Microsoft Azure Network APIs

Spark Compass: Elevating Mobile Tech with a Smart Cloud Shift

Lumen did the right thing!

Releasing the Logjam in the 5G Edge Computing Ecosystem

THE GCU (Google Cloud Updates) V -1.010

What's Really Behind the Cloud Wars?

Bidding Stack Network Architecture

Peering: A Shortcut Through the Network Maze

领英推荐

Load Balancing

Optimizing Communication: UDP over TCP

Shreeniwas V Iyer的更多文章

The True Cost of Building a (Software) Business

Retention Beyond the Paycheck: The Culture Equation

When an Ant Halts an Elephant: Lessons in Troubleshooting

Scaling AdTech Engineering: Building a Culture of Ownership, Optimization, and Impact

From Hops to Monoliths: Crafting High-Performance Architecture in AdTech

Inside the World of Trillions: The Real-Time Ad Auctions Powering the Internet

Engineering Operational Excellence: Moving Beyond Incident Management

?? Ever wondered how to efficiently store and query geospatial data?

Why the UK is Leading the Way in Tech Immigration

Balancing Speed and Stability: Continuous Learning from Incidents

社区洞察

其他会员也浏览了

Clouded Judgement: CISPE’s Microsoft surrender exposes broken cloud market

Cisco Partner Summit: Cisco's AI Infrastructure Play is a Strategic Move to Capture Enterprise AI Migration

Unicorn Massacre: Box Continues To Be Downgraded

Kentik Network News Issue #4: October 2024

Microsoft Azure Network APIs

Spark Compass: Elevating Mobile Tech with a Smart Cloud Shift

Lumen did the right thing!

Releasing the Logjam in the 5G Edge Computing Ecosystem

THE GCU (Google Cloud Updates) V -1.010

What's Really Behind the Cloud Wars?