登录查看更多内容

Step-by-Step Guide to Kafka Cluster Deployment Using Ansible

Vikash K.

Aws | Jenkins | kubernetes | Argocd | Docker | Prometheus-grafana | Ansible | Terraform

发布日期: 2024年8月31日

The Mission: 3 Zookeepers, 3 Kafka Brokers, 100% Automation

Setting up a Kafka cluster used to be a real headache. Imagine trying to organize a big party where none of the guests speak the same language, and you'll get an idea of what it was like before Ansible came along.

I remember the first time I tried to set up a Kafka cluster by hand. It was like trying to solve a giant puzzle where the pieces kept changing shape. Dealing with Zookeeper and getting all the Kafka brokers to work together across different servers was no easy task.

But don't worry! This story isn't just about the problems I faced. It's about how Ansible came to the rescue, like a helpful friend, and made everything much easier. Ansible helped bring order to our messy data system and saved us from the nightmare of setting things up manually.

So, whether you've been working with computer systems for years or you're just starting out and feeling a bit lost, stick around. I think you'll find this story useful. You might laugh a bit, you'll probably feel relieved, and I bet you'll see Kafka clusters in a whole new light by the end.

The Task: Building a Better Data System

It all started with a simple request from my boss. He said, "We need a Kafka cluster that can handle all our data without breaking. And please make it easy to use. Can you do that?"

Sounds simple, right? Well, not quite. He was asking me to build a strong, expandable messaging system that could handle our growing data needs and be tough enough to survive mistakes from our team. Oh, and I had to make sure we could set it up the same way every time, because doing it by chance and hoping it works isn't a good plan. Who would have guessed?

Why Ansible? Because I Value My Sanity (and My Weekends)

Look, I've done the whole "SSH into each server and manually configure things" dance before. Never. Again. Ansible lets me define my infrastructure as code, which means:

I can version control my entire cluster setup. Rollbacks? No sweat.
New team member? Here's the repo, now you know our entire setup. Onboarding time cut in half.
Need to scale up? Just add a few lines to the inventory and rerun the playbook. Done in minutes, not days.

Plus, it's agentless. One less thing to install and maintain on my servers? Yes, please!

The Repo: Where the Magic Happens

Alright, let's dig into the meat of this operation. First, clone the repo:

git clone https://github.com/thenameisvikash/kafkacluseterplaybook.git
cd kafkacluseterplaybook
git checkout master

Here's what you're getting into:

.
├── group_vars
├── inventory
│   ├── aws.yml
│   ├── onprem_keybased.yml
│   ├── onprem_sudopassword.yml
│   ├── onprem_userpassword.yml
│   └── vagrant.yml
├── roles
├── Vagrantfile
├── ansible.cfg
├── deploy-kafka-cluster.yml
├── destroy-kafka-zookeeper-setup.yml
├── setup-zookeeper.yml
└── vagrant.yml

Now, let's break down the key players in this Kafka drama.

Zookeeper: The Unsung Hero

First up, Zookeeper. It's like the air traffic controller of our Kafka cluster. Here's a snippet from our Zookeeper configuration:

---
zookeeper:
  env:
    ZOO_MY_ID: "{{ groups['zookeeper'].index(inventory_hostname) + 1 }}"
    ZOO_SERVERS: "{{ zookeeper_servers }}"
    ZOO_CFG_EXTRA: |
      quorumListenOnAllIPs=true
      clientPort={{ zookeeper_client_port }}
    ZOO_TICK_TIME: "{{ zookeeper_tick_time }}"
    ZOO_INIT_LIMIT: "{{ zookeeper_init_limit }}"
    ZOO_SYNC_LIMIT: "{{ zookeeper_sync_limit }}"

Let's break this down:

ZOO_MY_ID: This is crucial. Each Zookeeper instance needs a unique ID. We're using Ansible's magic to automatically assign these based on the inventory order. No more manually setting IDs!
ZOO_SERVERS: This is where we define our Zookeeper ensemble. The format is server.id=host:port:port. The first port (2888 in our case) is for follower connections to the leader, and the second (3888) is for leader election.
quorumListenOnAllIPs=true: This little gem allows Zookeeper to listen on all IPs for quorum and leader election traffic. Crucial when you're dealing with Docker or complex networking setups.
ZOO_TICK_TIME, ZOO_INIT_LIMIT, ZOO_SYNC_LIMIT: These are timing configurations. TICK_TIME is the basic time unit in milliseconds, INIT_LIMIT is the number of ticks allowed for followers to connect and sync with the leader, and SYNC_LIMIT is the number of ticks allowed for followers to sync with the leader.

Why these values? Well, TICK_TIME at 2000ms (2 seconds) is a good default. The INIT_LIMIT of 10 means we allow 20 seconds for initial syncing, which is usually plenty. SYNC_LIMIT at 5 gives us 10 seconds for syncing, which balances between allowing enough time for syncs and detecting failures quickly.

Kafka: The Main Event

Now, onto Kafka itself. Here's where things get really interesting:

领英推荐

? Advanced rollout techniques, Modern network…

Learnk8s 1 个月前

? GKE: one bad probe away from disaster, Resource…

Learnk8s 10 个月前

? Signing container images, Envelope encryption in…

Learnk8s 11 个月前

---
kafka:
  env:
    KAFKA_BROKER_ID: "{{ groups['kafka'].index(inventory_hostname) + 1 }}"
    KAFKA_ZOOKEEPER_CONNECT: "{{ kafka_zookeeper_connect }}"
    KAFKA_ADVERTISED_LISTENERS: "PLAINTEXT://{{ ansible_host }}:{{ kafka_port }}"
    KAFKA_LISTENERS: "PLAINTEXT://0.0.0.0:{{ kafka_port }}"
    KAFKA_LOG_SEGMENT_BYTES: "{{ kafka_log_segment_bytes }}"
    KAFKA_LOG_RETENTION_BYTES: "{{ kafka_log_retention_bytes }}"
    KAFKA_LOG_RETENTION_MS: "{{ kafka_log_retention_ms }}"
    KAFKA_OPTS: "-javaagent:/usr/app/jmx_prometheus_javaagent-0.16.1.jar=9582:/usr/app/prometheus-config.yml"

Let's dissect this:

KAFKA_BROKER_ID: Like Zookeeper, each Kafka broker needs a unique ID. Again, we're letting Ansible handle this automatically.
KAFKA_ZOOKEEPER_CONNECT: This is how our Kafka brokers find Zookeeper. We're listing all three Zookeeper instances for fault tolerance.
KAFKA_ADVERTISED_LISTENERS: This is the address that Kafka advertises to clients. Using {{ ansible_host }} means we're using the actual IP of the server, not a Docker internal IP.
KAFKA_LISTENERS: This tells Kafka to listen on all network interfaces inside the container.
KAFKA_LOG_SEGMENT_BYTES: We're setting this to 1GB (1073741824 bytes). This is the maximum size of a single log file. Why 1GB? It's a balance between not having too many small files (which can hurt performance) and not having files so large they're unwieldy.
KAFKA_LOG_RETENTION_BYTES and KAFKA_LOG_RETENTION_MS: These control how long Kafka keeps data. We're setting bytes to -1 (unlimited) and MS to 604800000 (1 week). This means we'll keep a week's worth of data regardless of size. Adjust these based on your data volume and retention needs!
KAFKA_OPTS: This sets up Prometheus monitoring for Kafka. Because let's face it, a Kafka cluster without monitoring is like driving blindfolded.

Monitoring: Keeping an Eye on the Beast

Speaking of monitoring, let's dive a bit deeper into our Prometheus setup. We're using the JMX Exporter to expose Kafka metrics to Prometheus. Here's a basic Prometheus configuration to get you started:

---
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'kafka'
    static_configs:
      - targets: ['kafka1:9582', 'kafka2:9582', 'kafka3:9582']

  - job_name: 'zookeeper'
    static_configs:
      - targets: ['zookeeper1:9582', 'zookeeper2:9582', 'zookeeper3:9582']

This configuration tells Prometheus to scrape metrics from our Kafka and Zookeeper instances every 15 seconds. You'll want to set up alerts for things like:

Under-replicated partitions
Offline partitions
Consumer group lag
Broker CPU and memory usage

Remember, a monitored Kafka cluster is a happy Kafka cluster!

The "Aha!" Moment: Dynamic Heap Size

Here's something I'm particularly proud of:

- name: Prompt for Kafka heap size if needed
  pause:
    prompt: "Enter Kafka heap size (e.g., 2G, 4G): "
  register: user_heap_size
  when: configure_heap | lower == 'yes'

- name: Set Kafka heap size
  set_fact:
    kafka_heap_opts: "-Xmx{{ user_heap_size.user_input }} -Xms{{ user_heap_size.user_input }}"
  when: configure_heap | lower == 'yes'

This lets us set Kafka's heap size dynamically during deployment. Why? Because every environment is different. A dev laptop doesn't need the same heap as a beefy prod server. This flexibility has saved my bacon more than once.

Running This Beast

Now, you've got options. Because I'm a firm believer in "test locally, deploy globally," I've included a Vagrant setup. Here's how you use it:

Spin up the Vagrant environment:
Once everything's up and running, deploy the cluster:
If you want to tear it all down (maybe you broke something, I won't judge):
To completely obliterate the Vagrant environment:

But wait, there's more! If you're feeling brave and want to deploy to actual servers, check out the inventory directory. You've got options for AWS, on-prem with key-based auth, and on-prem with password auth (both sudo and user). Choose your fighter and update the inventory file accordingly.

Security Measures: Locking Down Fort Kafka

Now, let's talk security. Our playbook sets up a basic cluster, but in production, you'll want to add some serious locks and alarms. Here's a quick rundown:

Enable SSL/TLS: Encrypt all communication between clients and brokers. Update your server.properties:
Set up SASL: Implement authentication. Add to server.properties:
Configure ACLs: Control who can do what. Use the kafka-acls.sh script:

Remember, security is not a one-time setup. It's an ongoing process. Keep your systems updated, rotate credentials regularly, and always follow the principle of least privilege.

Lessons Learned (The Hard Way)

Start small, scale up: Begin with modest LOG_RETENTION settings. You can always increase them, but decreasing them means potentially losing data.
Monitor, monitor, monitor: Set up those Prometheus alerts. Trust me, you want to know about under-replicated partitions before your users do.
Security isn't optional: As we discussed, add those security measures. Your future self (and your company's legal team) will thank you.
Backup Zookeeper: Seriously. A corrupted Zookeeper ensemble can ruin your whole week. Set up regular backups and test your restore process.
Document everything: Your 3 AM self will thank you when trying to debug an issue in production.

Wrapping Up

There you have it - a production-grade Kafka cluster, deployed with Ansible, explained by someone who's been in the trenches. Is it perfect? Probably not. But it's a solid start, and more importantly, it's automated and reproducible.

Remember, DevOps is a journey, not a destination. Start with this setup, watch it like a hawk, and iterate. Your perfect Kafka cluster is out there, and this playbook is your first step towards it.

Now, if you'll excuse me, I have some logs to check and a strong drink to pour. Happy clustering, and may your topics always be perfectly partitioned!

Got war stories of your own? Improvements to suggest? Found a bug that made you question your life choices? Drop an issue in the repo. After all, misery... I mean, DevOps, loves company!

要查看或添加评论，请登录

Vikash K.的更多文章

ArgoCD Email Alerts: A Step-by-Step Guide for Real-Time Monitoring

2024年10月11日

ArgoCD Email Alerts: A Step-by-Step Guide for Real-Time Monitoring

Hey there, Kubernetes aficionados! ?? Remember when we talked about ArgoCD being a DevOps engineer's best friend? Well,…
ArgoCD: A DevOps Engineer's Best Friend - Real-World Guide

2024年10月10日

ArgoCD: A DevOps Engineer's Best Friend - Real-World Guide

Hello, fellow Kubernetes enthusiasts! ?? Let me tell you a story. A few months ago, our team was drowning in Kubernetes…
DevOps Diaries: Turbocharging Your ML Infrastructure with GPU Goodness

2024年9月28日

DevOps Diaries: Turbocharging Your ML Infrastructure with GPU Goodness

Hello fellow DevOps warriors and ML enthusiasts! ?? It's your friendly neighborhood infrastructure guy here, ready to…
?? 10 Kubernetes Tips to Supercharge Your DevOps Workflow

2024年9月9日

?? 10 Kubernetes Tips to Supercharge Your DevOps Workflow

Hello container wranglers and DevOps enthusiasts! ?? After spending countless nights debugging Kubernetes clusters (and…
The Great Kubernetes Cluster Meltdown of 2024

2024年8月27日

The Great Kubernetes Cluster Meltdown of 2024

It was supposed to be a quiet Tuesday night. Instead, it turned into a Kubernetes nightmare that I'll never forget.
Kubernetes Probes: A Deep Dive into Application Health Checks

2024年8月25日

Kubernetes Probes: A Deep Dive into Application Health Checks

It was 2 AM, and my phone was buzzing incessantly. Bleary-eyed, I fumbled for it, already knowing what I'd see.
Metrics Server on Kubernetes: A Practical Guide

2024年8月22日

Metrics Server on Kubernetes: A Practical Guide

Look, I get it. You're knee-deep in Kubernetes clusters, juggling pods like a circus performer, and now someone's…
Building a Resilient Kubernetes Cluster: A Journey from Local to Production-Grade

2024年8月21日

Building a Resilient Kubernetes Cluster: A Journey from Local to Production-Grade

Remember when you first heard about Kubernetes and went "Huh? Is that some kinda new hipster coffee shop?" Don't sweat…
My Wild Ride Setting Up Kubernetes with Vagrant

2024年8月20日

My Wild Ride Setting Up Kubernetes with Vagrant

I finally bit the bullet and decided to set up a Kubernetes cluster using Vagrant. Let me tell you, it wasn't exactly…
DevOps Tool Names: The Weird and Wonderful

2024年7月31日

DevOps Tool Names: The Weird and Wonderful

So there I was, chilling with my buddy Yogesh last weekend. A few beers in, we were deep into our usual work-related…

See all articles

Step-by-Step Guide to Kafka Cluster Deployment Using Ansible

Vikash K.

Aws | Jenkins | kubernetes | Argocd | Docker | Prometheus-grafana | Ansible | Terraform

The Mission: 3 Zookeepers, 3 Kafka Brokers, 100% Automation

The Task: Building a Better Data System

Why Ansible? Because I Value My Sanity (and My Weekends)

The Repo: Where the Magic Happens

Zookeeper: The Unsung Hero

Kafka: The Main Event

领英推荐

Monitoring: Keeping an Eye on the Beast

The "Aha!" Moment: Dynamic Heap Size

Running This Beast

Security Measures: Locking Down Fort Kafka

Lessons Learned (The Hard Way)

Wrapping Up

Vikash K.的更多文章

社区洞察

其他会员也浏览了

? Journey with Cluster API, Horizontal Autoscaling in Kubernetes, Istio vs Kuma vs NSM, Escaping the OOM Killer, from on-prem to GKE, kube-vip

Learn Kubernetes weekly — issue 4

Closely Watched Kafka. Monitoring Apache Kafka in Business Process Management

Neon vs. Supabase: Which One Should I Choose

Migrating Postgres: from Render to Neon

Kafka with KRaft

RabbitMQ, Apache Kafka, and Apache ActiveMQ

Kafka vs. RabbitMQ: Which Message Queue Should You Choose? ??

SRE/Devops/Sysadmin newsletter : 2024/09

Kafka and ZooKeeper a short introduction

The Mission: 3 Zookeepers, 3 Kafka Brokers, 100% Automation

The Task: Building a Better Data System

Why Ansible? Because I Value My Sanity (and My Weekends)

The Repo: Where the Magic Happens

Zookeeper: The Unsung Hero

Kafka: The Main Event

领英推荐

Monitoring: Keeping an Eye on the Beast

The "Aha!" Moment: Dynamic Heap Size

Running This Beast

Security Measures: Locking Down Fort Kafka

Lessons Learned (The Hard Way)

Wrapping Up

Vikash K.的更多文章

ArgoCD Email Alerts: A Step-by-Step Guide for Real-Time Monitoring

ArgoCD: A DevOps Engineer's Best Friend - Real-World Guide

DevOps Diaries: Turbocharging Your ML Infrastructure with GPU Goodness

?? 10 Kubernetes Tips to Supercharge Your DevOps Workflow

The Great Kubernetes Cluster Meltdown of 2024

Kubernetes Probes: A Deep Dive into Application Health Checks

Metrics Server on Kubernetes: A Practical Guide

Building a Resilient Kubernetes Cluster: A Journey from Local to Production-Grade

My Wild Ride Setting Up Kubernetes with Vagrant

DevOps Tool Names: The Weird and Wonderful

社区洞察

其他会员也浏览了

? Journey with Cluster API, Horizontal Autoscaling in Kubernetes, Istio vs Kuma vs NSM, Escaping the OOM Killer, from on-prem to GKE, kube-vip

Learn Kubernetes weekly — issue 4

Closely Watched Kafka. Monitoring Apache Kafka in Business Process Management

Neon vs. Supabase: Which One Should I Choose

Migrating Postgres: from Render to Neon

Kafka with KRaft

RabbitMQ, Apache Kafka, and Apache ActiveMQ

Kafka vs. RabbitMQ: Which Message Queue Should You Choose? ??

SRE/Devops/Sysadmin newsletter : 2024/09

Kafka and ZooKeeper a short introduction