ZenDesk's Kafka mTLS Setup
?? Why mTLS?
It’s simply a very appealing way of both:
and
your connections.
mTLS is a well-established technology to:
The problem?
It requires extra infrastructure around it. ??
In particular - it requires extra infrastructure around Kafka in order to rotate certificates properly.
You NEED to be able to revoke certificates efficiently. This is important in cases where:
Revocation
There are two ways to do this:
The problem is that these two ways aren’t consistently implemented by Certificate Authorities (CAs) and clients.
Certificate Authority - an organization you trust that tells you X certificate belongs to Y
Revocation Problems
Let's reason about problems with revocating certificates from a web perspective - your browser and the public web.
In a closed internal system like Kafka, these aren't always necessarily as big of a problem. But their presence determine how widely adopted the protocols and standards are - and that is the problem. In any case, that doesn't mean you don't have a ton of gotchas.
Various problems exist with both, e.g:
CRLs Size
CRLs are basically just lists of all of the certificates that a given Certificate Authority (CA) has issued that have been revoked.
This means that they’re often very large – easily the size of a whole movie.
This then slows down the initial connection due to the required download.
OCSP Problems
OCSP is like:
"What if there were a separate CRL for every single certificate?"
Your browser can query the status for that particular certificate by contacting your CA's OCSP service. But then we get to three problems:
Perhaps due to these reasons, Google Chrome (63% market share) is the only browser that does NOT support OCSP since 2012, citing privacy and latency concerns. They implemented their own mechanism.
Soft Failing
And it's not an easy problem to solve. For the uptime problem - a naive approach would be to have browsers hard-fail during CA downtime. (i.e if you can't get the certificate verification, treat it as expired)
That immediately gives you three problems:
Unfortunately - there is no out-of-the-box way to do revocation.
It seems like there still isn’t an industry-wide consensus on the proper way to handle certificate revocation.
And because of that - a lack of proper tooling and library support to revoke certificates for private mTLS setups.
The result?
Everybody implements mTLS in their own way.
Let’s see how ZenDesk did it!
领英推荐
Stan: There are a lot of interesting problems here. We will surely be posting more about it. Follow me at ? Stanislav Kozlovski and ? https://2minutestreaming.com/ to not miss it!
ZenDesk
For a CA, they chose Vault. It offers a PKI (public key infrastructure) backend which generates dynamic X.509 certificates for you. ??
For the source of truth regarding WHO the CA is, they store it using a globally-replicated key in Consul. ??
They then introduce two components with separate responsibilities:
As they run in Kubernetes, they basically have a PKI Auth Manager side-car on the Kafka pod that talks to Consul + Vault to generate a certificate. It then stores it in a local Secrets Volume.
This component also regenerates certificates before they expire. As well as other miscellaneous things like converting the PEM-formatted certificate (taken from Vault) to the JVM-expected keystore format and emitting audit logs for observability.
They then have a TLS monitor that watches the Secrets Volume and is responsible for asking Kafka to reload the file.
Kafka allows you to reload its certificates without restarting it through the kafka-configs command, so that’s one simple API call.
As for the Kafka clients (their applications), they don’t seem to use the TLS monitor - something else reloads their certificates on the client side.
Regardless, the big question for them was:
how do we revoke certificates?
They did not want to rotate certificates because the JVM didn’t have “strong support” for revocation.
Kafka itself doesn’t support a way to check for revoked certificates via either CRL/OCSP, so their only other option was to enable the revocation on the whole JVM via CRL.
That’s fine for the brokers - but for their applications, enabling it on the whole JVM means that every public HTTPS call that their app does will also go through the CRL certificate-checking process. They did not want that, considering that some of those calls use public certificates and others (the Kafka ones) private. ??
As we discussed, public certificate checking is widely broken - and trust me there are many more gotchas.
In the end, they decided to NOT enable certificate revocation checking and simply revoke certificates by rotating the Root CA.
That way, when Kafka loads the new certificate from a different CA, the old one is automatically invalidated. (doesn't matter if it's not expired, the CA is different!)
Given this is a private network and they have two Root CAs that they trust, this is a viable solution.
Easy to implement?
Maybe.
The next problem to solve was:
how do you broadcast any CA changes sufficiently quickly?
They opted for Blocking Queries in Consul. ?
Those are essentially long-poll HTTP calls, similar to how Kafka consumers have a max fetch wait time. The GET API responds as soon as the key you’re querying changes, or, a minute passes.
This gives them a fast notification mechanism once the CA is changed. ??
Certificate Rotation
Now for the actual rotation! ??
How do you do it without downtime in the applications?
It’s not a simple switch - you have to be more methodical. ????
Their PKI Auth Manager and TLS Monitor work in tandem to rotate the certificate.
Changing the CA from root A to root B without downtime involves 3 important steps:
0. A is the CA. (no action)
In each step, a new certificate is generated and used.
It follows the usual path:
A total of 3 certificate swaps later, and you are done! ?
This is how ZenDesk does it. ??
?? Hey. Did you like this type of content?
I have two simple requests. They take 4 seconds to do, writing this takes me 4 hours.
Engineering Manager II at Grab
7 个月when digging for mTLS with kafka, I found this useful article. Thank you Stanislav Kozlovski. To add on, do you know how Uber (or other companies) can mitigate/overcome the performance penalty coming with mTLS? Thank you
Professional Pessimist & Eng @ CFLT Cloud's Kora
1 年(Ty for playing “Shay imagines he’s a security engineer”)
Professional Pessimist & Eng @ CFLT Cloud's Kora
1 年Could even be better. Here’s a scheme where all the components actually exists today, but none are integrated. Works not just for Kafka, ofc, but for every system that uses public key cryptography to validate hosts belong to the organization. 1. Generate key on host, preferably in a secure element, such that the private key never leaves it. 2. (That’s the novel step) host creates attestation report, verifying both it is the host you think it is, and the boot measurements. 3. Send pub key and report to CA. CA verifies the report and signs the key. 4. Host uses signed key to sign a key with very short expiry (minutes), that key is the one actually used. Now you only need to invalidate keys when the CA key leaks. But everything below the signer in step 3 above is explicitly tied to specific invocation, hence leak proof. So where’s the catch? 1. You need to tie all the components together. I’m certain someone did, but I don’t know of any off the shelf, public or private solution. 2. If you’re running in your own DC, you need attestation support from your systems provider. I’m told that implementations leave a lot to be desired. 3. The hyperscalers offer SEV-SNP and equivalent solutions, but charge extra for it.
Professional Pessimist & Eng @ CFLT Cloud's Kora
1 年Am I reading correctly they generate the private keys remotely and send them to the broker? As they already built sidecars, they could do better by generating the cert locally, and then sending just the public part for the CA to sign, while the private part never leaves the host. The auth issue is the same as now (you need to have some mutual secret on the host to auth to the CA; you need to have some mutual secret on the host to auth to vault).