登录查看更多内容

ZenDesk's Kafka mTLS Setup

Stanislav Kozlovski

?? 'The Kafka Guy'

发布日期: 2023年12月5日

+ 关注

?? Why mTLS?

It’s simply a very appealing way of both:

encrypting

and

authenticating

your connections.

mTLS is a well-established technology to:

prove the identity of both the sender and recipient of a message.
assert that the message has not been viewed nor modified in transit by a third party.

The problem?

It requires extra infrastructure around it. ??

In particular - it requires extra infrastructure around Kafka in order to rotate certificates properly.

You NEED to be able to revoke certificates efficiently. This is important in cases where:

?? the certificate leaks (equivalent of your password being leaked)
?? is stale enough that it requires a refresh (it is proper security practice to refresh as frequently as practical).

Revocation

There are two ways to do this:

Certificate revocation list (CRL) - a signed blacklist of certificates that should NOT be used
Online certificate status protocol (OCSP) - it’s a protocol (over HTTP) used to communicate revoked certificates.

The problem is that these two ways aren’t consistently implemented by Certificate Authorities (CAs) and clients.

Certificate Authority - an organization you trust that tells you X certificate belongs to Y

Revocation Problems

Let's reason about problems with revocating certificates from a web perspective - your browser and the public web.

In a closed internal system like Kafka, these aren't always necessarily as big of a problem. But their presence determine how widely adopted the protocols and standards are - and that is the problem. In any case, that doesn't mean you don't have a ton of gotchas.

Various problems exist with both, e.g:

CRLs Size

CRLs are basically just lists of all of the certificates that a given Certificate Authority (CA) has issued that have been revoked.

This means that they’re often very large – easily the size of a whole movie.

This then slows down the initial connection due to the required download.

OCSP Problems

OCSP is like:

"What if there were a separate CRL for every single certificate?"

Your browser can query the status for that particular certificate by contacting your CA's OCSP service. But then we get to three problems:

uptime - if the CA's service is down, you get no response. Browsers usually soft-fail in these cases and treat it as a "not revoked" response. Not secure. ??
cache - to reduce the load on the services and the response time, we cache the response for ~a week. This means you can trust a revoked certificated for 7 days after it's been revoked. ??

no privacy - with your certificate request, you're basically sharing your browser's history to the CA.

Perhaps due to these reasons, Google Chrome (63% market share) is the only browser that does NOT support OCSP since 2012, citing privacy and latency concerns. They implemented their own mechanism.

Soft Failing

And it's not an easy problem to solve. For the uptime problem - a naive approach would be to have browsers hard-fail during CA downtime. (i.e if you can't get the certificate verification, treat it as expired)

That immediately gives you three problems:

fragility - Changes incentives such that taking down an OCSP service is enough to take down a large part of the internet. DDoS attackers would turn to that honeypot, which would increase the costs of running a reliable OCSP service and make it less likely that CAs will be able to afford it.
captive portals - things like hotel WiFi networks that want you to "login" before you can use the internet. They frequently use HTTPS but don't allow you to access OCSP servers.
random failures - even if the CA has 100% uptime, the public internet network is such that you may still get your request dropped for some reason. That means you'd have random cases where a website just doesn't load.

Unfortunately - there is no out-of-the-box way to do revocation.

It seems like there still isn’t an industry-wide consensus on the proper way to handle certificate revocation.

And because of that - a lack of proper tooling and library support to revoke certificates for private mTLS setups.

The result?

Everybody implements mTLS in their own way.

Let’s see how ZenDesk did it!

领英推荐

Why we forked Semgrep: Opengrep & our commitment to…

Aikido Security 1 个月前

APP and API delivery: Deep dive into the NGNIX…

SlashData 2 年前

Implementing OAuth 2.0 with Google Drive using FastAPI

Inspiring Lab 4 个月前

Stan: There are a lot of interesting problems here. We will surely be posting more about it. Follow me at ? Stanislav Kozlovski and ? https://2minutestreaming.com/ to not miss it!

ZenDesk

For a CA, they chose Vault. It offers a PKI (public key infrastructure) backend which generates dynamic X.509 certificates for you. ??

For the source of truth regarding WHO the CA is, they store it using a globally-replicated key in Consul. ??

They then introduce two components with separate responsibilities:

a PKI Auth Manager sidecar to generate certificates remotely in Vault & store them locally on the node. ??
a TLS Monitor sidecar to watch for these local filesystem changes & make sure Kafka reloads the cert. ??

As they run in Kubernetes, they basically have a PKI Auth Manager side-car on the Kafka pod that talks to Consul + Vault to generate a certificate. It then stores it in a local Secrets Volume.

This component also regenerates certificates before they expire. As well as other miscellaneous things like converting the PEM-formatted certificate (taken from Vault) to the JVM-expected keystore format and emitting audit logs for observability.

They then have a TLS monitor that watches the Secrets Volume and is responsible for asking Kafka to reload the file.

Kafka allows you to reload its certificates without restarting it through the kafka-configs command, so that’s one simple API call.

As for the Kafka clients (their applications), they don’t seem to use the TLS monitor - something else reloads their certificates on the client side.

Regardless, the big question for them was:

how do we revoke certificates?

They did not want to rotate certificates because the JVM didn’t have “strong support” for revocation.

Kafka itself doesn’t support a way to check for revoked certificates via either CRL/OCSP, so their only other option was to enable the revocation on the whole JVM via CRL.

That’s fine for the brokers - but for their applications, enabling it on the whole JVM means that every public HTTPS call that their app does will also go through the CRL certificate-checking process. They did not want that, considering that some of those calls use public certificates and others (the Kafka ones) private. ??

As we discussed, public certificate checking is widely broken - and trust me there are many more gotchas.

In the end, they decided to NOT enable certificate revocation checking and simply revoke certificates by rotating the Root CA.

That way, when Kafka loads the new certificate from a different CA, the old one is automatically invalidated. (doesn't matter if it's not expired, the CA is different!)

Given this is a private network and they have two Root CAs that they trust, this is a viable solution.

Easy to implement?

Maybe.

The next problem to solve was:

how do you broadcast any CA changes sufficiently quickly?

They opted for Blocking Queries in Consul. ?

Those are essentially long-poll HTTP calls, similar to how Kafka consumers have a max fetch wait time. The GET API responds as soon as the key you’re querying changes, or, a minute passes.

This gives them a fast notification mechanism once the CA is changed. ??

Certificate Rotation

Now for the actual rotation! ??

How do you do it without downtime in the applications?

It’s not a simple switch - you have to be more methodical. ????

Their PKI Auth Manager and TLS Monitor work in tandem to rotate the certificate.

Changing the CA from root A to root B without downtime involves 3 important steps:

0. A is the CA. (no action)

Add B as a secondary CA.
Swap A and B, so that B is now the primary CA and A the secondary.
Remove A from the secondary CA.

In each step, a new certificate is generated and used.

It follows the usual path:

Consul value changes.
PKI Auth Manager is notified.
a new certificate is generated and stored on the local volume.
the certificate is applied by the TLS Monitor.

A total of 3 certificate swaps later, and you are done! ?

This is how ZenDesk does it. ??

?? Hey. Did you like this type of content?

I have two simple requests. They take 4 seconds to do, writing this takes me 4 hours.

Share it so that your network learns too! ??
Follow me here for more - ? Stanislav Kozlovski

Tran Quang Minh

Engineering Manager II at Grab

7 个月

when digging for mTLS with kafka, I found this useful article. Thank you Stanislav Kozlovski. To add on, do you know how Uber (or other companies) can mitigate/overcome the performance penalty coming with mTLS? Thank you

Shay E.

Professional Pessimist & Eng @ CFLT Cloud's Kora

1 年

(Ty for playing “Shay imagines he’s a security engineer”)

Shay E.

Professional Pessimist & Eng @ CFLT Cloud's Kora

1 年

Could even be better. Here’s a scheme where all the components actually exists today, but none are integrated. Works not just for Kafka, ofc, but for every system that uses public key cryptography to validate hosts belong to the organization. 1. Generate key on host, preferably in a secure element, such that the private key never leaves it. 2. (That’s the novel step) host creates attestation report, verifying both it is the host you think it is, and the boot measurements. 3. Send pub key and report to CA. CA verifies the report and signs the key. 4. Host uses signed key to sign a key with very short expiry (minutes), that key is the one actually used. Now you only need to invalidate keys when the CA key leaks. But everything below the signer in step 3 above is explicitly tied to specific invocation, hence leak proof. So where’s the catch? 1. You need to tie all the components together. I’m certain someone did, but I don’t know of any off the shelf, public or private solution. 2. If you’re running in your own DC, you need attestation support from your systems provider. I’m told that implementations leave a lot to be desired. 3. The hyperscalers offer SEV-SNP and equivalent solutions, but charge extra for it.

1 次回应

Shay E.

Professional Pessimist & Eng @ CFLT Cloud's Kora

1 年

Am I reading correctly they generate the private keys remotely and send them to the broker? As they already built sidecars, they could do better by generating the cert locally, and then sending just the public part for the CA to sign, while the private part never leaves the host. The auth issue is the same as now (you need to have some mutual secret on the host to auth to the CA; you need to have some mutual secret on the host to auth to vault).

查看更多评论

要查看或添加评论，请登录

Stanislav Kozlovski的更多文章

Apache Kafka 3.9.0 Release Summary

2024年11月9日

Apache Kafka 3.9.0 Release Summary

History has been made. This week the final 3.

11 条评论
How Confluent acquired WarpStream for $220m after just 13 months of operation

2024年11月2日

How Confluent acquired WarpStream for $220m after just 13 months of operation

In August of 2023, WarpStream shook up the Kafka industry by announcing a novel Kafka-API compatible cloud-native…

24 条评论
Incremental Cooperative Consumer Group Rebalances

2024年6月20日

Incremental Cooperative Consumer Group Rebalances

Do you want to 2x your Kafka consumers’ throughput during consumer group rebalances? ?? … I may have a something for…

1 条评论
Cloudflare ?? PostgreSQL

2024年3月21日

Cloudflare ?? PostgreSQL

Cloudflare serves around 20% of the web with 46 million requests a second. Surely they must have a lot of data.

3 条评论
Web PKI is Broken

2024年1月25日

Web PKI is Broken

A famous saying: Web PKI revocation is broken. But why is that? (and what does it mean?) Let’s dive in.

2 条评论
How PgBouncer protects PostgreSQL at Cloudflare

2023年8月12日

How PgBouncer protects PostgreSQL at Cloudflare

reading time: 4 minutes. A company like Cloudflare knows a thing or two about protecting systems against client…

1 条评论
?? Apache Pinot: How LinkedIn used a built-in Sketch Algorithm to reduce data usage by 88% - from 1TB -> 120GB ??

2023年8月8日

?? Apache Pinot: How LinkedIn used a built-in Sketch Algorithm to reduce data usage by 88% - from 1TB -> 120GB ??

How did LinkedIn use a sketch algorithm in #ApachePinot to achieve: an 88% reduction of data (1TB → 120GB) improve data…

7 条评论
Real Time Infra for Petabytes of Data a Day at Uber

2023年7月21日

Real Time Infra for Petabytes of Data a Day at Uber

Uber has paved the way in showing how to both: ? build infrastructure to support massive amounts of data. ? leverage…

3 条评论
The Story of Atlassian's 13-Day Outage ????

2023年7月13日

The Story of Atlassian's 13-Day Outage ????

It’s a nice, refreshing spring day. You open up your laptop in the morning and go straight to that design document you…

28 条评论
Kafka Acks Explained

2022年11月6日

Kafka Acks Explained

Visualizing Kafka’s most misunderstood configuration setting Having worked with Kafka for more than four years now…

12 条评论

See all articles

ZenDesk's Kafka mTLS Setup

Stanislav Kozlovski

?? 'The Kafka Guy'

?? Why mTLS?

Revocation

Revocation Problems

CRLs Size

OCSP Problems

Soft Failing

领英推荐

ZenDesk

Certificate Rotation

?? Hey. Did you like this type of content?

Stanislav Kozlovski的更多文章

社区洞察

其他会员也浏览了

Palette 3.2 Release Highlights

Have you ever tried to add arbitrary DNS entries to any Pod/Deployment?

Getting started with Border0

So, you know REST

Myro Smart Web: A Pioneering P2P Mega Platform Built on Distributed System Principles

Best Rotating Proxies: Features and Pricing

Securing your AKS Deployments - Microservice User Authentication using Azure AD and Oauth 2 Proxy

Enchanting Monitoring and Caching with ColdFusion

Enhancing Application Performance with Client-Side Caching: Cache-Control and ETags.

Basics of Systems Designs

?? Why mTLS?

Revocation

Revocation Problems

CRLs Size

OCSP Problems

Soft Failing

领英推荐

ZenDesk

Certificate Rotation

?? Hey. Did you like this type of content?

Stanislav Kozlovski的更多文章

Apache Kafka 3.9.0 Release Summary

How Confluent acquired WarpStream for $220m after just 13 months of operation

Incremental Cooperative Consumer Group Rebalances

Cloudflare ?? PostgreSQL

Web PKI is Broken

How PgBouncer protects PostgreSQL at Cloudflare

?? Apache Pinot: How LinkedIn used a built-in Sketch Algorithm to reduce data usage by 88% - from 1TB -> 120GB ??

Real Time Infra for Petabytes of Data a Day at Uber

The Story of Atlassian's 13-Day Outage ????

Kafka Acks Explained

社区洞察

其他会员也浏览了

Palette 3.2 Release Highlights

Have you ever tried to add arbitrary DNS entries to any Pod/Deployment?

Getting started with Border0

So, you know REST

Myro Smart Web: A Pioneering P2P Mega Platform Built on Distributed System Principles

Best Rotating Proxies: Features and Pricing

Securing your AKS Deployments - Microservice User Authentication using Azure AD and Oauth 2 Proxy

Enchanting Monitoring and Caching with ColdFusion

Enhancing Application Performance with Client-Side Caching: Cache-Control and ETags.

Basics of Systems Designs