ZenDesk's Kafka mTLS Setup

ZenDesk's Kafka mTLS Setup



?? Why mTLS?

It’s simply a very appealing way of both:

  • encrypting

and

  • authenticating

your connections.

mTLS is a well-established technology to:

  1. prove the identity of both the sender and recipient of a message.
  2. assert that the message has not been viewed nor modified in transit by a third party.

The problem?

It requires extra infrastructure around it. ??

In particular - it requires extra infrastructure around Kafka in order to rotate certificates properly.

You NEED to be able to revoke certificates efficiently. This is important in cases where:

  • ?? the certificate leaks (equivalent of your password being leaked)
  • ?? is stale enough that it requires a refresh (it is proper security practice to refresh as frequently as practical).

Revocation

There are two ways to do this:

  • Certificate revocation list (CRL) - a signed blacklist of certificates that should NOT be used
  • Online certificate status protocol (OCSP) - it’s a protocol (over HTTP) used to communicate revoked certificates.

The problem is that these two ways aren’t consistently implemented by Certificate Authorities (CAs) and clients.

Certificate Authority - an organization you trust that tells you X certificate belongs to Y

Revocation Problems

Let's reason about problems with revocating certificates from a web perspective - your browser and the public web.

In a closed internal system like Kafka, these aren't always necessarily as big of a problem. But their presence determine how widely adopted the protocols and standards are - and that is the problem. In any case, that doesn't mean you don't have a ton of gotchas.

Various problems exist with both, e.g:

CRLs Size

CRLs are basically just lists of all of the certificates that a given Certificate Authority (CA) has issued that have been revoked.

This means that they’re often very large – easily the size of a whole movie.

This then slows down the initial connection due to the required download.

OCSP Problems

OCSP is like:

"What if there were a separate CRL for every single certificate?"

Your browser can query the status for that particular certificate by contacting your CA's OCSP service. But then we get to three problems:

  • uptime - if the CA's service is down, you get no response. Browsers usually soft-fail in these cases and treat it as a "not revoked" response. Not secure. ??
  • cache - to reduce the load on the services and the response time, we cache the response for ~a week. This means you can trust a revoked certificated for 7 days after it's been revoked. ??

  • no privacy - with your certificate request, you're basically sharing your browser's history to the CA.

Perhaps due to these reasons, Google Chrome (63% market share) is the only browser that does NOT support OCSP since 2012, citing privacy and latency concerns. They implemented their own mechanism.

Soft Failing

giphy.com

And it's not an easy problem to solve. For the uptime problem - a naive approach would be to have browsers hard-fail during CA downtime. (i.e if you can't get the certificate verification, treat it as expired)

That immediately gives you three problems:

  1. fragility - Changes incentives such that taking down an OCSP service is enough to take down a large part of the internet. DDoS attackers would turn to that honeypot, which would increase the costs of running a reliable OCSP service and make it less likely that CAs will be able to afford it.
  2. captive portals - things like hotel WiFi networks that want you to "login" before you can use the internet. They frequently use HTTPS but don't allow you to access OCSP servers.
  3. random failures - even if the CA has 100% uptime, the public internet network is such that you may still get your request dropped for some reason. That means you'd have random cases where a website just doesn't load.

Unfortunately - there is no out-of-the-box way to do revocation.

It seems like there still isn’t an industry-wide consensus on the proper way to handle certificate revocation.

And because of that - a lack of proper tooling and library support to revoke certificates for private mTLS setups.

The result?

Everybody implements mTLS in their own way.

Let’s see how ZenDesk did it!

Stan: There are a lot of interesting problems here. We will surely be posting more about it. Follow me at ? Stanislav Kozlovski and ? https://2minutestreaming.com/ to not miss it!

ZenDesk

For a CA, they chose Vault. It offers a PKI (public key infrastructure) backend which generates dynamic X.509 certificates for you. ??

For the source of truth regarding WHO the CA is, they store it using a globally-replicated key in Consul. ??

They then introduce two components with separate responsibilities:

  • a PKI Auth Manager sidecar to generate certificates remotely in Vault & store them locally on the node. ??
  • a TLS Monitor sidecar to watch for these local filesystem changes & make sure Kafka reloads the cert. ??

As they run in Kubernetes, they basically have a PKI Auth Manager side-car on the Kafka pod that talks to Consul + Vault to generate a certificate. It then stores it in a local Secrets Volume.

This component also regenerates certificates before they expire. As well as other miscellaneous things like converting the PEM-formatted certificate (taken from Vault) to the JVM-expected keystore format and emitting audit logs for observability.

They then have a TLS monitor that watches the Secrets Volume and is responsible for asking Kafka to reload the file.

Kafka allows you to reload its certificates without restarting it through the kafka-configs command, so that’s one simple API call.

As for the Kafka clients (their applications), they don’t seem to use the TLS monitor - something else reloads their certificates on the client side.

Regardless, the big question for them was:

how do we revoke certificates?

They did not want to rotate certificates because the JVM didn’t have “strong support” for revocation.

Kafka itself doesn’t support a way to check for revoked certificates via either CRL/OCSP, so their only other option was to enable the revocation on the whole JVM via CRL.

That’s fine for the brokers - but for their applications, enabling it on the whole JVM means that every public HTTPS call that their app does will also go through the CRL certificate-checking process. They did not want that, considering that some of those calls use public certificates and others (the Kafka ones) private. ??

As we discussed, public certificate checking is widely broken - and trust me there are many more gotchas.

In the end, they decided to NOT enable certificate revocation checking and simply revoke certificates by rotating the Root CA.

That way, when Kafka loads the new certificate from a different CA, the old one is automatically invalidated. (doesn't matter if it's not expired, the CA is different!)

Given this is a private network and they have two Root CAs that they trust, this is a viable solution.

Easy to implement?

Maybe.

The next problem to solve was:

how do you broadcast any CA changes sufficiently quickly?

They opted for Blocking Queries in Consul. ?

Those are essentially long-poll HTTP calls, similar to how Kafka consumers have a max fetch wait time. The GET API responds as soon as the key you’re querying changes, or, a minute passes.

This gives them a fast notification mechanism once the CA is changed. ??

Certificate Rotation

Now for the actual rotation! ??

How do you do it without downtime in the applications?

It’s not a simple switch - you have to be more methodical. ????

Their PKI Auth Manager and TLS Monitor work in tandem to rotate the certificate.

Changing the CA from root A to root B without downtime involves 3 important steps:

0. A is the CA. (no action)

  1. Add B as a secondary CA.
  2. Swap A and B, so that B is now the primary CA and A the secondary.
  3. Remove A from the secondary CA.

In each step, a new certificate is generated and used.

Changing the CA source of truth


It follows the usual path:

  1. Consul value changes.
  2. PKI Auth Manager is notified.
  3. a new certificate is generated and stored on the local volume.
  4. the certificate is applied by the TLS Monitor.

A total of 3 certificate swaps later, and you are done! ?

This is how ZenDesk does it. ??


?? Hey. Did you like this type of content?

I have two simple requests. They take 4 seconds to do, writing this takes me 4 hours.

  1. Share it so that your network learns too! ??
  2. Follow me here for more - ? Stanislav Kozlovski


Tran Quang Minh

Engineering Manager II at Grab

7 个月

when digging for mTLS with kafka, I found this useful article. Thank you Stanislav Kozlovski. To add on, do you know how Uber (or other companies) can mitigate/overcome the performance penalty coming with mTLS? Thank you

回复
Shay E.

Professional Pessimist & Eng @ CFLT Cloud's Kora

1 年

(Ty for playing “Shay imagines he’s a security engineer”)

回复
Shay E.

Professional Pessimist & Eng @ CFLT Cloud's Kora

1 年

Could even be better. Here’s a scheme where all the components actually exists today, but none are integrated. Works not just for Kafka, ofc, but for every system that uses public key cryptography to validate hosts belong to the organization. 1. Generate key on host, preferably in a secure element, such that the private key never leaves it. 2. (That’s the novel step) host creates attestation report, verifying both it is the host you think it is, and the boot measurements. 3. Send pub key and report to CA. CA verifies the report and signs the key. 4. Host uses signed key to sign a key with very short expiry (minutes), that key is the one actually used. Now you only need to invalidate keys when the CA key leaks. But everything below the signer in step 3 above is explicitly tied to specific invocation, hence leak proof. So where’s the catch? 1. You need to tie all the components together. I’m certain someone did, but I don’t know of any off the shelf, public or private solution. 2. If you’re running in your own DC, you need attestation support from your systems provider. I’m told that implementations leave a lot to be desired. 3. The hyperscalers offer SEV-SNP and equivalent solutions, but charge extra for it.

Shay E.

Professional Pessimist & Eng @ CFLT Cloud's Kora

1 年

Am I reading correctly they generate the private keys remotely and send them to the broker? As they already built sidecars, they could do better by generating the cert locally, and then sending just the public part for the CA to sign, while the private part never leaves the host. The auth issue is the same as now (you need to have some mutual secret on the host to auth to the CA; you need to have some mutual secret on the host to auth to vault).

回复

要查看或添加评论,请登录

Stanislav Kozlovski的更多文章

社区洞察

其他会员也浏览了