登录查看更多内容

How Confluent acquired WarpStream for $220m after just 13 months of operation

Stanislav Kozlovski

?? 'The Kafka Guy'

发布日期: 2024年11月2日

In August of 2023, WarpStream shook up the Kafka industry by announcing a novel Kafka-API compatible cloud-native implementation that used no disks.

Instead? It used S3.

It was a viral HackerNews post named “Kafka is Dead, Long Live Kafka!”.

Just a year and month later - on September 9, 2024 - Confluent acquired them for $220M (!).

Why did they do that?

WarpStream’s innovative architecture gave them two major advantages that nobody could compete with:

massive cost savings ??
massive operational simplification ??

The only drawback?

Latency was high.

? p99 for writes is 400ms ??

? p99 e2e (from write to read) is 1 second ??

Since WarpStream writes directly to S3 and also has to buffer writes so S3 PUT costs don't explode - it suffers from latency.

This wasn’t a problem. They had one key finding. ??

Kafka workloads are either:

high volume and latency insensitive ??
low volume and latency sensitive ???

It’s precisely the high volume workloads that cost a fortune.

Customers were happy to make the trade off - increase latency, but save costs.

The cost savings were indeed the juicy part:

Here I compare a Kafka cluster with:

1 GB/s Produce Inbound
3 GB/s Consume Outbound
7 day retention on disk

using retail AWS prices.

WarpStream is fundamentally:

10x cheaper than an unoptimized deployment. ?? ($500k vs $5.25M)
4x cheaper than an optimized deployment. ?? ($500k vs $2M)

We're talking a ~$500k cost versus $2M.

The right architecture can be the difference between millions of dollars a year in infrastructure costs.

Like this in-depth cost analysis so far?

A lot more is to come. Make sure to follow me on all platforms to not miss a beat:

? LinkedIn: Stanislav Kozlovski
? Twitter/X
? 2minutestreaming newsletter (2 minute reads)
? BigDataStream newsletter (long-form reads & interviews)

Where do WarpStream's savings come from?

?? Network costs.

The cross-zone charges you sustain from a regular Kafka deployment are the largest expense in a high-volume deployment.

They can be 80% of the total cost!

领英推荐

Eks Auto Mode: Simplifying Kubernetes Management

CloudifyOps 1 个月前

Simplyblock Byte - Issue 2024-12/1

simplyblock 3 个月前

AWS ECS vs. EKS: What is The Difference

IPSpecialist 2 年前

As you can see, even in the optimized deployment here, we have major charges from:

producers writing to leader brokers in other availability zones
brokers replicating to each other across zones

A less-optimized Kafka can cost you more than twice - $5.2M/yr.

A large chunk coming from EBS disks (no tiered storage) and consumer networking.

Anyway. If you optimize Kafka as much as possible, you’ll get to a ~$2.1M annual cost.

Out of that - $1.68M (80%) are UNAVOIDABLE network costs.

But WarpStream avoided them. ??

WarpStream drives all these costs to zero.

every agent can be a leader for every partition, so producers write to agents in the same zone, removing that cross-zone charge ??
broker replication doesn’t exist, so that charge is zero (it’s done inside S3 for free) ??
consumers can read from brokers in the same zone, just like Kafka ??
no disks are necessary at all (everything is in s3) ??

They offered it through a hybrid BYOC model, where:

the customer hosts the stateless agents in their cloud
WarpStream hosts a SaaS control plane

How?

This is where the operational simplicity comes in.

The secret sauce is the control plane. ??

All the complex logic lives inside the control plane.

It’s essentially a sequencer that leverages DynamoDB and gives each agent the offset for each partition it wants to write.

This lets them keep the agents dumb and stateless. Their write flow is roughy:

accept produce requests and batch the data
persist the data in s3
commit the offsets to the control plane
the control plane commits the offsets and s3 file references to DynamoDB
the agent acks the request to the producer

All the state they get from the control plane, and the complex offset synchronization happens there.

Agents are therefore allowed to scale up and down effortlessly, like nginx.

Ingenious design. Literally worth hundreds of millions of dollars. All within a year.

Kudos to the team ??

source: https://investors.confluent.io/node/10746/html

Liked this article?

It took me hours to research and write.

I ask you for one favor (it'll take you 2 seconds) ??

Share it to your network! ??

Tuomas Mustaj?rvi

Head of Software Development at Mandatum

4 个月

Cool implementation! These guys are really on to something.

Arunesh Sarker

Software Engineer II at Workday | ex-Flipkart

4 个月

Amazing design choices, and a really smart move by Confluent

Srinivas Reddy A.

Cloud Data / AI Engineering Lead

4 个月

Great post!! #AutoMQ is another open-source similar to #wrapstream, it's a cloud-native alternative to #ApacheKafka that enables data persistence directly to S3. I appreciate its clever tweaks to the storage layer, achieving significant cost benefits with only a modest increase in latency to the 10s ms range. It also maintains strong compatibility with most of the Kafka API.

5 次回应

Felipe Silveira

Senior Data Reliability Engineer @ Feedzai

4 个月

Nicolas Takashi ??

1 次回应

Xiao Cui

Engineer | Ex-AWS

4 个月

The story struck me - WarpStream traded latency for huge cost savings by leveraging S3. Smart design, fantastic results, and an inspiring story!

4 次回应

查看更多评论

要查看或添加评论，请登录

Stanislav Kozlovski的更多文章

Apache Kafka 3.9.0 Release Summary

2024年11月9日

Apache Kafka 3.9.0 Release Summary

History has been made. This week the final 3.

11 条评论
Incremental Cooperative Consumer Group Rebalances

2024年6月20日

Incremental Cooperative Consumer Group Rebalances

Do you want to 2x your Kafka consumers’ throughput during consumer group rebalances? ?? … I may have a something for…

1 条评论
Cloudflare ?? PostgreSQL

2024年3月21日

Cloudflare ?? PostgreSQL

Cloudflare serves around 20% of the web with 46 million requests a second. Surely they must have a lot of data.

3 条评论
Web PKI is Broken

2024年1月25日

Web PKI is Broken

A famous saying: Web PKI revocation is broken. But why is that? (and what does it mean?) Let’s dive in.

2 条评论
ZenDesk's Kafka mTLS Setup

2023年12月5日

ZenDesk's Kafka mTLS Setup

?? Why mTLS? It’s simply a very appealing way of both: encrypting and authenticating your connections. mTLS is a…

11 条评论
How PgBouncer protects PostgreSQL at Cloudflare

2023年8月12日

How PgBouncer protects PostgreSQL at Cloudflare

reading time: 4 minutes. A company like Cloudflare knows a thing or two about protecting systems against client…

1 条评论
?? Apache Pinot: How LinkedIn used a built-in Sketch Algorithm to reduce data usage by 88% - from 1TB -> 120GB ??

2023年8月8日

?? Apache Pinot: How LinkedIn used a built-in Sketch Algorithm to reduce data usage by 88% - from 1TB -> 120GB ??

How did LinkedIn use a sketch algorithm in #ApachePinot to achieve: an 88% reduction of data (1TB → 120GB) improve data…

7 条评论
Real Time Infra for Petabytes of Data a Day at Uber

2023年7月21日

Real Time Infra for Petabytes of Data a Day at Uber

Uber has paved the way in showing how to both: ? build infrastructure to support massive amounts of data. ? leverage…

3 条评论
The Story of Atlassian's 13-Day Outage ????

2023年7月13日

The Story of Atlassian's 13-Day Outage ????

It’s a nice, refreshing spring day. You open up your laptop in the morning and go straight to that design document you…

28 条评论
Kafka Acks Explained

2022年11月6日

Kafka Acks Explained

Visualizing Kafka’s most misunderstood configuration setting Having worked with Kafka for more than four years now…

12 条评论

See all articles

How Confluent acquired WarpStream for $220m after just 13 months of operation

Stanislav Kozlovski

?? 'The Kafka Guy'

领英推荐

Stanislav Kozlovski的更多文章

社区洞察

其他会员也浏览了

Scaling Red Hat OpenShift-V with Lightbits

How to Build Scalable Systems: Lessons from High-Traffic Applications

Potential Risks of using Serverless on AWS Lambda?

Introduction to Amazon Simple Queue Service (SQS): A Reliable Message Queuing Service

Kafka vs SQS: Differences

Embracing Cloud-Native Technologies: A Paradigm Shift in IT Infrastructure - Blackben Technology

Simplifying Kubernetes Cluster Autoscaling with Karpenter

Introduction to Service Discovery

Cloud-Native Essentials: Abstracted Endpoints

Kubernetes - The New Hope.

领英推荐

Stanislav Kozlovski的更多文章

Apache Kafka 3.9.0 Release Summary

Incremental Cooperative Consumer Group Rebalances

Cloudflare ?? PostgreSQL

Web PKI is Broken

ZenDesk's Kafka mTLS Setup

How PgBouncer protects PostgreSQL at Cloudflare

?? Apache Pinot: How LinkedIn used a built-in Sketch Algorithm to reduce data usage by 88% - from 1TB -> 120GB ??

Real Time Infra for Petabytes of Data a Day at Uber

The Story of Atlassian's 13-Day Outage ????

Kafka Acks Explained

社区洞察

其他会员也浏览了

Scaling Red Hat OpenShift-V with Lightbits

How to Build Scalable Systems: Lessons from High-Traffic Applications

Potential Risks of using Serverless on AWS Lambda?

Introduction to Amazon Simple Queue Service (SQS): A Reliable Message Queuing Service

Kafka vs SQS: Differences

Embracing Cloud-Native Technologies: A Paradigm Shift in IT Infrastructure - Blackben Technology

Simplifying Kubernetes Cluster Autoscaling with Karpenter

Introduction to Service Discovery

Cloud-Native Essentials: Abstracted Endpoints

Kubernetes - The New Hope.