登录查看更多内容

Kafka Schema Registry

?? Saral Saxena ??????

?11K+ Followers | Linkedin Top Voice || Associate Director || 15+ Years in Java, Microservices, Kafka, Spring Boot, Cloud Technologies (AWS, GCP) | Agile , K8s ,DevOps & CI/CD Expert

发布日期: 2024年7月25日

Think of Kafka as a postman who only delivers letters but doesn't have a clue about what's written inside. Kafka transfers data solely in a byte format, completely unaware of the type of data it's handling. On the other hand, our producers and consumers need to know the type of information they're dealing with to process it accurately. It's kind of like knowing whether the letter is a bill, a greeting card, or a love letter!

Now, when a producer sends a message to a Kafka cluster, it can choose the format of the data. It's like choosing the style of writing, the language, or even the color of the ink! But here's the catch - whenever you decide to change the style (or the schema of messages), everyone involved - every consumer and producer - needs to be in the loop.

This is where schemas and the schema registry come into play.

we will explore the role of Schema Registry in the Kafka world, a separate component that stands aside from the Kafka broker, ensuring the consistency of messages exchanged between Kafka producers and consumers. We will start with the fundamentals—data serialization and deserialization, along with different data formats. Next, we discuss the importance of schemas and how schema registries explain how schema registries assist developers in sharing schemas between producers and consumers.

Data serialization is the process of converting complex data structures, such as objects or data arrays, into a format that can be easily stored, transmitted, or reconstructed later. Typically, this format is a stream of bytes that can be written to a file or transmitted over a network.

The serialized data needs to be deserialized, or reconstructed back into its original form before it can be used again. Deserialization involves the reconstruction of the original data structure from the byte stream.

A serializer is a software component responsible for serialization, while a deserializer handles deserialization.

Schemas - a mutual understanding between applications

We utilize various data formats to represent data, such as JSON, XML, Avro, Google Protocol Buffers (ProtoBuff), and YAML. Having a well-defined format is like having a common language that all developers can understand and use, facilitating collaboration and communication.

However, formats only define how data should be structured, but not what the data should be.

This is where schemas come into play.

A schema is like a blueprint of how the data should be constructed. It defines the type of data (e.g., integer, string, date), the order of data, and whether the data is mandatory or optional. By defining these aspects, schemas provide a much more robust and detailed definition of the data than formats alone.

In Kafka, a message is composed of a key and a value. Different serializers and deserializers (SerDe) can be specified for these keys and values. These Serdes are a part of the language-specific SDK and support data formats including Apache Avro, JSON Schema, and Google’s Protobuf.

For example:

StringSerializer / StringDeserializer: This is used when the key and/or value is a string.
IntegerSerializer / IntegerDeserializer: This is used when the key and/or value is a 32-bit integer.
BytesSerializer / BytesDeserializer: This is used when the key and/or value is a byte array.
AvroSerializer / AvroDeserializer: This is used when the key and/or value is an Avro object.

Schema registry for Kafka

A schema registry is a central repository where schemas are stored. It provides producers and consumers with APIs to register, discover, and retrieve schemas during data serialization and deserialization.

In a typical Kafka deployment, the schema registry is an independent application component. You need to deploy and manage it separately from the broker runtime. Kafka producer and consumer applications communicate with the schema registry using APIs exposed by the registry. The registry typically operates through HTTP(S) and listens on port 8081 by default for RESTful API calls.

Why do you need a schema registry?

When a producer sends a message, it serializes the message using a provided schema. The consumer requires this schema to deserialize the message. But how can the producer share the schema with the consumer?

The schema registry helps both producers and consumers by offering a common location to share schemas. Having a common repository for schemas eliminates the need to embed schemas in each message or share schemas manually, both of which can be inefficient or chaotic.

Schema registry information hierarchy

A schema registry maintains a hierarchy of information for keeping track of subjects, schemas, and their versions.

When a new schema is registered with the schema registry, it’s always associated with a subject, representing a unique namespace within the registry. Multiple versions of the same schema can be registered under the same subject, and a unique schema ID identifies each version. The subject name is used to organize schemas and ensure a unique identifier for each schema within the registry.

领英推荐

Understanding the Apache Iceberg Manifest File

Alex Merced 6 个月前

Reliability with Apache Iceberg

Alex Merced 7 个月前

Azure Data Engineer Interview questions with Answers…

Sharad Sonwane 3 个月前

To put this into perspective, consider the following example:

Let's assume we have a Kafka topic called customer_orders, storing customer orders. The initial version of the schema (version 1) for the messages in this topic might look like this:

{
  "type": "record",
  "name": "CustomerOrder",
  "fields": [
    {"name": "order_id", "type": "int"},
    {"name": "customer_id", "type": "int"},
    {"name": "product_id", "type": "int"},
    {"name": "quantity", "type": "int"}
  ]
}

This schema is registered under the subject customer_orders-value, indicating that this is the schema for the value portion of messages in the customer_orders topic.

Now, suppose we want to add a timestamp field to our messages to track when each order was placed. We would create a new version of the schema (version 2), like this:

{
  "type": "record",
  "name": "CustomerOrder",
  "fields": [
    {"name": "order_id", "type": "int"},
    {"name": "customer_id", "type": "int"},
    {"name": "product_id", "type": "int"},
    {"name": "quantity", "type": "int"},
    {"name": "timestamp", "type": "long"}
  ]
}

This new schema would also be registered under the subject customer_orders-value. Producers would start using the new schema to serialize messages, and consumers would use it to deserialize messages.

Meanwhile, the old schema (version 1) would also remain registered under the same subject, allowing consumers to continue processing older messages that were serialized with the old schema.

This is how a subject can have multiple versions of a schema over time as the schema evolves.

How does the schema registry work?

When a Kafka producer wants to send a message, it passes the message to the appropriate key/value serializer. The serializer then determines which schema version to use for serialization.

To do this, the serializer first verifies if the schemaID for the given subject is present in the local schema cache. If the schemaID isn’t in the cache, the serializer registers the schema in the schema registry and collects the resulting schemaID in the response.

In either case, the serializer should have the schemaID by now and proceed with adding padding to the beginning of the message, containing:

The magic byte - always contains the value of 0
SchemaID - 4 bytes long integer containing the schemaID

This applies equally to the key and value of the message.

Finally, with the schema in hand, the serializer serializes the message and returns the byte sequence to the producer. The producer then publishes this byte sequence to the broker.

On the other hand, when a consumer receives a message, it is passed to the deserializer. The deserializer first checks the existence of the magic byte and rejects the message if it doesn't.

The deserializer then reads the schemaID and checks whether the related schema exists in its local cache. If that exists, deserialization happens with that schema. Otherwise, the deserializer retrieves the schema from the registry based on the schemaID. Once the schema is in place, the deserializer proceeds with the deserialization.

the schema registry enables the serializer to register a schema and embed its schemaID into each message. The deserializer then uses the schemaID while retrieving the schema from the registry during deserialization. This approach eliminates the need to embed schemas in each message or share schemas manually, both of which can be inefficient or chaotic when schemas evolve.

Schema registry enables the serializer to register a schema and embed its schemaID into each message. The deserializer then uses the schemaID while retrieving the schema from the registry during deserialization. This approach eliminates the need to embed schemas in each message or share schemas manually, both of which can be inefficient or chaotic when schemas evolve.

In addition, a schema registry supports schema evolution, allowing multiple versions of a schema to exist simultaneously. This is crucial in situations where the data structure changes over time but older versions of the data still need to be processed correctly.

Furthermore, a schema registry enhances data quality and consistency across different applications by ensuring that all producers and consumers adhere to the same schema. This minimizes the risk of data loss or corruption due to schema mismatch.

In conclusion, the schema registry is an indispensable tool in the Kafka ecosystem, ensuring data consistency, supporting schema evolution, and improving overall data quality.

要查看或添加评论，请登录

?? Saral Saxena ???????的更多文章

Validating Payloads with Spring Boot 3.4.0

2025年2月1日

Validating Payloads with Spring Boot 3.4.0

First, let’s examine a controller that receives a object. This object contains fields such as: first name, last name…
Limitations of Java Executor Framework.

2024年12月26日

Limitations of Java Executor Framework.

The Java Executor Framework has inherent limitations that affect its performance in high-throughput, low-latency…
??Structured Logging in Spring Boot 3.4??

2024年12月8日

??Structured Logging in Spring Boot 3.4??

Spring Boot 3.4 has been released ??, and as usual, I want to introduce you to some of its new features.
Sending large payload as response in optimized way

2024年12月1日

Sending large payload as response in optimized way

Handling large payloads in a Java microservices application, sending large responses efficiently while maintaining…
Disaster Recovery- Strategies

2024年11月30日

Disaster Recovery- Strategies

Backup and Restore This is the simplest of the approaches and as the name implies, it involves periodically performing…
Memory Optimization Techniques for Spring Boot Applications with Practical Coding Strategies

2024年10月27日

Memory Optimization Techniques for Spring Boot Applications with Practical Coding Strategies

Learn practical coding strategies to optimize memory usage in Spring Boot applications. This guide covers efficient…
Designing CI/CD Pipeline

2024年9月28日

Designing CI/CD Pipeline

Problem statement You are responsible for designing and implementing a CI/CD pipeline for a large-scale microservices…
Calculate CPU for containers in k8s dynamically

2024年9月27日

Calculate CPU for containers in k8s dynamically

It’s possible to dynamically resize the CPU on containers in k8s with the feature gate “InPlacePodVerticalScaling”…
Downside of the Executor Service with context to thread local

2024年9月22日

Downside of the Executor Service with context to thread local

The executor service creates a pool of threads that you can submit tasks to. The benefit of this approach is that you…

See all articles

Kafka Schema Registry

?? Saral Saxena ??????

?11K+ Followers | Linkedin Top Voice || Associate Director || 15+ Years in Java, Microservices, Kafka, Spring Boot, Cloud Technologies (AWS, GCP) | Agile , K8s ,DevOps & CI/CD Expert

Schemas - a mutual understanding between applications

Schema registry for Kafka

Why do you need a schema registry?

Schema registry information hierarchy

领英推荐

How does the schema registry work?

?? Saral Saxena ???????的更多文章

社区洞察

其他会员也浏览了

Azure Data Engineer Interview questions with Answers 2024

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

Selecting a Data Integration Tool

Delta Lake Hits 20 Million Monthly Downloads and Unveils Groundbreaking Features in 4.0.0 Release

Introducing the leading open-source Kafka Connector for Amazon S3

Avoid These Airflow Mistakes: Best Practices for Reliable Data Pipelines

Low-Latency Data Pipelines with Kafka and Apache Pinot

A Day in the Life of a Big Data Engineer in 2024

ConfigMaps in Kubernetes

Why Open Table Formats and Apache Iceberg Are Reshaping Data Engineering

Schemas - a mutual understanding between applications

Schema registry for Kafka

Why do you need a schema registry?

Schema registry information hierarchy

领英推荐

How does the schema registry work?

?? Saral Saxena ???????的更多文章

Validating Payloads with Spring Boot 3.4.0

Limitations of Java Executor Framework.

??Structured Logging in Spring Boot 3.4??

Sending large payload as response in optimized way

Disaster Recovery- Strategies

Memory Optimization Techniques for Spring Boot Applications with Practical Coding Strategies

Designing CI/CD Pipeline

Calculate CPU for containers in k8s dynamically

Downside of the Executor Service with context to thread local

社区洞察

其他会员也浏览了

Azure Data Engineer Interview questions with Answers 2024

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

Selecting a Data Integration Tool

Delta Lake Hits 20 Million Monthly Downloads and Unveils Groundbreaking Features in 4.0.0 Release

Introducing the leading open-source Kafka Connector for Amazon S3

Avoid These Airflow Mistakes: Best Practices for Reliable Data Pipelines

Low-Latency Data Pipelines with Kafka and Apache Pinot

A Day in the Life of a Big Data Engineer in 2024

ConfigMaps in Kubernetes

Why Open Table Formats and Apache Iceberg Are Reshaping Data Engineering