登录查看更多内容

How Slack Runs Cron Scripts Reliably At Scale

Prabhash K.

Java | Spring Boot | Hibernate | React | Kafka | AWS | Docker

发布日期: 2024年9月2日

Cron jobs are scheduled tasks that run automatically on a server at specific times or intervals. They're often used to perform routine maintenance or automate repetitive tasks. For example, cron jobs can back up a database every night, generate reports at the end of the day, or clear out temporary files once a week.

Slack uses cron scripts to handle tasks like sending reminders, email delivery, and database cleaning. As the number of scripts and the data they processed grew, they sometimes didn't work reliably and became harder to manage. This made Slack realize they needed a better system to run these scripts more reliably and at a larger scale.

How it was before

When Slack first started running cron scripts, it was done straightforwardly. There was one server with all the scripts and a single crontab file that managed their schedules. This server was responsible for running the scripts as scheduled. As the number of scripts and the data they processed increased, Slack kept upgrading to bigger servers with more CPU and RAM to keep things working.

However, this setup wasn't very reliable. If anything went wrong with that one server, it could stop some key Slack functions from working. After repeatedly patching up the system, they decided it was time to build a new, more reliable, and scalable cron execution service.

The New System

The new service has three main parts:

A new Golang service called the “Scheduled Job Conductor,” which runs on Bedrock, Slack’s system built on Kubernetes.
Slack’s "Job Queue", an asynchronous platform that handles a large amount of work quickly and efficiently.
A "Vitess table" for tracking and monitoring jobs, helping us see when jobs run and if they fail.

Scheduled Job Conductor

Slack used a Golang cron library with Bedrock, their Kubernetes-based system, to manage multiple pods. Kubernetes Leader Election selects one pod for scheduling, while others remain on standby. To ensure smooth transitions between pods, Slack avoids shutting down the active pod at the start of a new minute. This is important because cron jobs are often scheduled to run at the beginning of the minute, so they want to avoid disrupting this timing.

Arpit Bhayani 2 年前

SRE/Devops/Sysadmin newsletter : 2024/09

Xavier Pestel 2 个月前

Rethinking FinOps with Platform-Anchored Solutions:…

Steven Kaplan 4 个月前

Only the leader sends jobs for execution

It might seem like using more pods to handle jobs would be better, as it would prevent a single point of failure and spread out the workload. However, Slack decided that keeping the pods synchronized would be too complicated. There are two reasons for this:

Pods can switch leaders quickly, so downtime is unlikely.
Most of the heavy lifting of running scripts is handled by Slack’s Job Queue, with the pods mainly focusing on scheduling.

Job Queue

Slack’s Job Queue is a system that handles about 9 billion jobs each day. It works with a series of “queues” where jobs pass through. Jobs are moved through Kafka for long-term storage in case of problems or delays, then into Redis for short-term storage and extra details about who is running the job, and finally to a “job worker” — a node that actually runs the job. In this setup, each job is a single script. Although it’s an asynchronous system, it processes work very quickly when each job is handled separately.

Vitess Database Table

The Vitess table will handle duplicate job prevention and track job status by completing the new service. Their old cron system used a Linux tool called "flocks" to make sure only one copy of a script ran at a time. This usually worked, but for some scripts that take longer than their scheduled time, two copies could run at once. In the new system, each job run is recorded in a table, and the job’s status is updated as it progresses. This way, before starting a new job, the service can check the table to see if a job is already running. They use an index on script names to speed up these checks.

In summary, Slack's new system for running cron scripts is now more reliable, can handle more work, and is easier for users. Although the old crontab on a single server worked for a while, it began to cause problems and couldn’t keep up with Slack’s growth. The new system provides the flexibility Slack needs to expand both now and in the future.

References

Slack engineering blog

How Slack Runs Cron Scripts Reliably At Scale

Prabhash K.

Java | Spring Boot | Hibernate | React | Kafka | AWS | Docker

How it was before

The New System

Scheduled Job Conductor

领英推荐

Job Queue

Vitess Database Table

Tech Talk

1,133 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

A cheat sheet of common Terraform commands to help you with infrastructure as code:

Handling RabbitMQ Tasks and Sending Emails with SMTP

Kubernetes Custom Resource and Custom Resource Definition (CRD)

Building a Highly Flexible Control Plane with Kubevela and Tofu-Controller: A Step-by-Step Guide

Docker Compose: The Art of Managing Multiple Containers

Efficient File Synchronization with rsync

Why Managed Database Service is not Enough

9. On RHEL, SQL assessments, HTTP requests in load testing, Windows workload cost reduction in AKS, and livestream goodness.

What is Kubernetes Custom Resource and Custom Resource Definition (CRD)?

RabbitMQ EventBus system

How it was before

The New System

Scheduled Job Conductor

领英推荐

Job Queue

Vitess Database Table

Tech Talk

1,133 位关注者

Understanding JDK, JRE, and JVM

2024年11月18日

The Rise of AI in Coding: What It Means for Software Engineers and the Tech Industry

2024年11月4日

The Impact of Events on Observability in Booking.com

2024年10月21日

Uber’s Implementation of Tiered Storage in Kafka

2024年10月7日

Avoiding MySQL Gap Lock Deadlocks in High-Concurrency Systems

2024年9月30日

Inside Reddit’s Real-Time Safety System: The Story of Signals-Joiner

2024年9月16日

Working with JWT

2024年9月9日

How Tinder built its API Gateway

2024年8月26日

Kafka Connect

2024年8月19日

Internals of Kafka Topics - Producers & Consumers

2024年8月12日

社区洞察

其他会员也浏览了

A cheat sheet of common Terraform commands to help you with infrastructure as code:

Handling RabbitMQ Tasks and Sending Emails with SMTP

Kubernetes Custom Resource and Custom Resource Definition (CRD)

Building a Highly Flexible Control Plane with Kubevela and Tofu-Controller: A Step-by-Step Guide

Docker Compose: The Art of Managing Multiple Containers

Efficient File Synchronization with rsync

Why Managed Database Service is not Enough

9. On RHEL, SQL assessments, HTTP requests in load testing, Windows workload cost reduction in AKS, and livestream goodness.

What is Kubernetes Custom Resource and Custom Resource Definition (CRD)?

RabbitMQ EventBus system