How Confluent acquired WarpStream for $220m after just 13 months of operation

How Confluent acquired WarpStream for $220m after just 13 months of operation

In August of 2023, WarpStream shook up the Kafka industry by announcing a novel Kafka-API compatible cloud-native implementation that used no disks.

Instead? It used S3.

It was a viral HackerNews post named “Kafka is Dead, Long Live Kafka!”.

Just a year and month later - on September 9, 2024 - Confluent acquired them for $220M (!).

Why did they do that?

WarpStream’s innovative architecture gave them two major advantages that nobody could compete with:

  • massive cost savings ??
  • massive operational simplification ??

The only drawback?

Latency was high.

? p99 for writes is 400ms ??

? p99 e2e (from write to read) is 1 second ??

Since WarpStream writes directly to S3 and also has to buffer writes so S3 PUT costs don't explode - it suffers from latency.

This wasn’t a problem. They had one key finding. ??

Kafka workloads are either:

  • high volume and latency insensitive ??
  • low volume and latency sensitive ???

It’s precisely the high volume workloads that cost a fortune.

Customers were happy to make the trade off - increase latency, but save costs.

The cost savings were indeed the juicy part:

sheet link:

Here I compare a Kafka cluster with:

  • 1 GB/s Produce Inbound
  • 3 GB/s Consume Outbound
  • 7 day retention on disk

using retail AWS prices.

WarpStream is fundamentally:

  • 10x cheaper than an unoptimized deployment. ?? ($500k vs $5.25M)
  • 4x cheaper than an optimized deployment. ?? ($500k vs $2M)

We're talking a ~$500k cost versus $2M.

The right architecture can be the difference between millions of dollars a year in infrastructure costs.



Like this in-depth cost analysis so far?

A lot more is to come. Make sure to follow me on all platforms to not miss a beat:


Where do WarpStream's savings come from?

?? Network costs.

The cross-zone charges you sustain from a regular Kafka deployment are the largest expense in a high-volume deployment.

They can be 80% of the total cost!


As you can see, even in the optimized deployment here, we have major charges from:

  • producers writing to leader brokers in other availability zones
  • brokers replicating to each other across zones

A less-optimized Kafka can cost you more than twice - $5.2M/yr.

A large chunk coming from EBS disks (no tiered storage) and consumer networking.


Anyway. If you optimize Kafka as much as possible, you’ll get to a ~$2.1M annual cost.

Out of that - $1.68M (80%) are UNAVOIDABLE network costs.

But WarpStream avoided them. ??

WarpStream drives all these costs to zero.

  • every agent can be a leader for every partition, so producers write to agents in the same zone, removing that cross-zone charge ??
  • broker replication doesn’t exist, so that charge is zero (it’s done inside S3 for free) ??
  • consumers can read from brokers in the same zone, just like Kafka ??
  • no disks are necessary at all (everything is in s3) ??

They offered it through a hybrid BYOC model, where:

  • the customer hosts the stateless agents in their cloud
  • WarpStream hosts a SaaS control plane

How?

This is where the operational simplicity comes in.

The secret sauce is the control plane. ??

All the complex logic lives inside the control plane.

It’s essentially a sequencer that leverages DynamoDB and gives each agent the offset for each partition it wants to write.

This lets them keep the agents dumb and stateless. Their write flow is roughy:

  1. accept produce requests and batch the data
  2. persist the data in s3
  3. commit the offsets to the control plane
  4. the control plane commits the offsets and s3 file references to DynamoDB
  5. the agent acks the request to the producer

All the state they get from the control plane, and the complex offset synchronization happens there.


Agents are therefore allowed to scale up and down effortlessly, like nginx.

Ingenious design. Literally worth hundreds of millions of dollars. All within a year.

Kudos to the team ??

source: https://investors.confluent.io/node/10746/html



Liked this article?

It took me hours to research and write.

I ask you for one favor (it'll take you 2 seconds) ??

??

  • Share it to your network! ??


Tuomas Mustaj?rvi

Head of Software Development at Mandatum

4 个月

Cool implementation! These guys are really on to something.

回复
Arunesh Sarker

Software Engineer II at Workday | ex-Flipkart

4 个月

Amazing design choices, and a really smart move by Confluent

回复
Srinivas Reddy A.

Cloud Data / AI Engineering Lead

4 个月

Great post!! #AutoMQ is another open-source similar to #wrapstream, it's a cloud-native alternative to #ApacheKafka that enables data persistence directly to S3. I appreciate its clever tweaks to the storage layer, achieving significant cost benefits with only a modest increase in latency to the 10s ms range. It also maintains strong compatibility with most of the Kafka API.

Felipe Silveira

Senior Data Reliability Engineer @ Feedzai

4 个月
Xiao Cui

Engineer | Ex-AWS

4 个月

The story struck me - WarpStream traded latency for huge cost savings by leveraging S3. Smart design, fantastic results, and an inspiring story!

要查看或添加评论,请登录

Stanislav Kozlovski的更多文章

  • Apache Kafka 3.9.0 Release Summary

    Apache Kafka 3.9.0 Release Summary

    History has been made. This week the final 3.

    11 条评论
  • Incremental Cooperative Consumer Group Rebalances

    Incremental Cooperative Consumer Group Rebalances

    Do you want to 2x your Kafka consumers’ throughput during consumer group rebalances? ?? … I may have a something for…

    1 条评论
  • Cloudflare ?? PostgreSQL

    Cloudflare ?? PostgreSQL

    Cloudflare serves around 20% of the web with 46 million requests a second. Surely they must have a lot of data.

    3 条评论
  • Web PKI is Broken

    Web PKI is Broken

    A famous saying: Web PKI revocation is broken. But why is that? (and what does it mean?) Let’s dive in.

    2 条评论
  • ZenDesk's Kafka mTLS Setup

    ZenDesk's Kafka mTLS Setup

    ?? Why mTLS? It’s simply a very appealing way of both: encrypting and authenticating your connections. mTLS is a…

    11 条评论
  • How PgBouncer protects PostgreSQL at Cloudflare

    How PgBouncer protects PostgreSQL at Cloudflare

    reading time: 4 minutes. A company like Cloudflare knows a thing or two about protecting systems against client…

    1 条评论
  • ?? Apache Pinot: How LinkedIn used a built-in Sketch Algorithm to reduce data usage by 88% - from 1TB -> 120GB ??

    ?? Apache Pinot: How LinkedIn used a built-in Sketch Algorithm to reduce data usage by 88% - from 1TB -> 120GB ??

    How did LinkedIn use a sketch algorithm in #ApachePinot to achieve: an 88% reduction of data (1TB → 120GB) improve data…

    7 条评论
  • Real Time Infra for Petabytes of Data a Day at Uber

    Real Time Infra for Petabytes of Data a Day at Uber

    Uber has paved the way in showing how to both: ? build infrastructure to support massive amounts of data. ? leverage…

    3 条评论
  • The Story of Atlassian's 13-Day Outage ????

    The Story of Atlassian's 13-Day Outage ????

    It’s a nice, refreshing spring day. You open up your laptop in the morning and go straight to that design document you…

    28 条评论
  • Kafka Acks Explained

    Kafka Acks Explained

    Visualizing Kafka’s most misunderstood configuration setting Having worked with Kafka for more than four years now…

    12 条评论

社区洞察

其他会员也浏览了