Kafka Schema Registry
?? Saral Saxena ??????
?11K+ Followers | Linkedin Top Voice || Associate Director || 15+ Years in Java, Microservices, Kafka, Spring Boot, Cloud Technologies (AWS, GCP) | Agile , K8s ,DevOps & CI/CD Expert
Think of Kafka as a postman who only delivers letters but doesn't have a clue about what's written inside. Kafka transfers data solely in a byte format, completely unaware of the type of data it's handling. On the other hand, our producers and consumers need to know the type of information they're dealing with to process it accurately. It's kind of like knowing whether the letter is a bill, a greeting card, or a love letter!
Now, when a producer sends a message to a Kafka cluster, it can choose the format of the data. It's like choosing the style of writing, the language, or even the color of the ink! But here's the catch - whenever you decide to change the style (or the schema of messages), everyone involved - every consumer and producer - needs to be in the loop.
This is where schemas and the schema registry come into play.
we will explore the role of Schema Registry in the Kafka world, a separate component that stands aside from the Kafka broker, ensuring the consistency of messages exchanged between Kafka producers and consumers. We will start with the fundamentals—data serialization and deserialization, along with different data formats. Next, we discuss the importance of schemas and how schema registries explain how schema registries assist developers in sharing schemas between producers and consumers.
Data serialization is the process of converting complex data structures, such as objects or data arrays, into a format that can be easily stored, transmitted, or reconstructed later. Typically, this format is a stream of bytes that can be written to a file or transmitted over a network.
The serialized data needs to be deserialized, or reconstructed back into its original form before it can be used again. Deserialization involves the reconstruction of the original data structure from the byte stream.
A serializer is a software component responsible for serialization, while a deserializer handles deserialization.
Schemas - a mutual understanding between applications
We utilize various data formats to represent data, such as JSON, XML, Avro, Google Protocol Buffers (ProtoBuff), and YAML. Having a well-defined format is like having a common language that all developers can understand and use, facilitating collaboration and communication.
However, formats only define how data should be structured, but not what the data should be.
This is where schemas come into play.
A schema is like a blueprint of how the data should be constructed. It defines the type of data (e.g., integer, string, date), the order of data, and whether the data is mandatory or optional. By defining these aspects, schemas provide a much more robust and detailed definition of the data than formats alone.
In Kafka, a message is composed of a key and a value. Different serializers and deserializers (SerDe) can be specified for these keys and values. These Serdes are a part of the language-specific SDK and support data formats including Apache Avro, JSON Schema, and Google’s Protobuf.
For example:
Schema registry for Kafka
A schema registry is a central repository where schemas are stored. It provides producers and consumers with APIs to register, discover, and retrieve schemas during data serialization and deserialization.
In a typical Kafka deployment, the schema registry is an independent application component. You need to deploy and manage it separately from the broker runtime. Kafka producer and consumer applications communicate with the schema registry using APIs exposed by the registry. The registry typically operates through HTTP(S) and listens on port 8081 by default for RESTful API calls.
Why do you need a schema registry?
When a producer sends a message, it serializes the message using a provided schema. The consumer requires this schema to deserialize the message. But how can the producer share the schema with the consumer?
The schema registry helps both producers and consumers by offering a common location to share schemas. Having a common repository for schemas eliminates the need to embed schemas in each message or share schemas manually, both of which can be inefficient or chaotic.
Schema registry information hierarchy
A schema registry maintains a hierarchy of information for keeping track of subjects, schemas, and their versions.
When a new schema is registered with the schema registry, it’s always associated with a subject, representing a unique namespace within the registry. Multiple versions of the same schema can be registered under the same subject, and a unique schema ID identifies each version. The subject name is used to organize schemas and ensure a unique identifier for each schema within the registry.
领英推荐
To put this into perspective, consider the following example:
Let's assume we have a Kafka topic called customer_orders, storing customer orders. The initial version of the schema (version 1) for the messages in this topic might look like this:
{
"type": "record",
"name": "CustomerOrder",
"fields": [
{"name": "order_id", "type": "int"},
{"name": "customer_id", "type": "int"},
{"name": "product_id", "type": "int"},
{"name": "quantity", "type": "int"}
]
}
This schema is registered under the subject customer_orders-value, indicating that this is the schema for the value portion of messages in the customer_orders topic.
Now, suppose we want to add a timestamp field to our messages to track when each order was placed. We would create a new version of the schema (version 2), like this:
{
"type": "record",
"name": "CustomerOrder",
"fields": [
{"name": "order_id", "type": "int"},
{"name": "customer_id", "type": "int"},
{"name": "product_id", "type": "int"},
{"name": "quantity", "type": "int"},
{"name": "timestamp", "type": "long"}
]
}
This new schema would also be registered under the subject customer_orders-value. Producers would start using the new schema to serialize messages, and consumers would use it to deserialize messages.
Meanwhile, the old schema (version 1) would also remain registered under the same subject, allowing consumers to continue processing older messages that were serialized with the old schema.
This is how a subject can have multiple versions of a schema over time as the schema evolves.
How does the schema registry work?
When a Kafka producer wants to send a message, it passes the message to the appropriate key/value serializer. The serializer then determines which schema version to use for serialization.
To do this, the serializer first verifies if the schemaID for the given subject is present in the local schema cache. If the schemaID isn’t in the cache, the serializer registers the schema in the schema registry and collects the resulting schemaID in the response.
In either case, the serializer should have the schemaID by now and proceed with adding padding to the beginning of the message, containing:
This applies equally to the key and value of the message.
Finally, with the schema in hand, the serializer serializes the message and returns the byte sequence to the producer. The producer then publishes this byte sequence to the broker.
On the other hand, when a consumer receives a message, it is passed to the deserializer. The deserializer first checks the existence of the magic byte and rejects the message if it doesn't.
The deserializer then reads the schemaID and checks whether the related schema exists in its local cache. If that exists, deserialization happens with that schema. Otherwise, the deserializer retrieves the schema from the registry based on the schemaID. Once the schema is in place, the deserializer proceeds with the deserialization.
Schema registry enables the serializer to register a schema and embed its schemaID into each message. The deserializer then uses the schemaID while retrieving the schema from the registry during deserialization. This approach eliminates the need to embed schemas in each message or share schemas manually, both of which can be inefficient or chaotic when schemas evolve.
In addition, a schema registry supports schema evolution, allowing multiple versions of a schema to exist simultaneously. This is crucial in situations where the data structure changes over time but older versions of the data still need to be processed correctly.
Furthermore, a schema registry enhances data quality and consistency across different applications by ensuring that all producers and consumers adhere to the same schema. This minimizes the risk of data loss or corruption due to schema mismatch.
In conclusion, the schema registry is an indispensable tool in the Kafka ecosystem, ensuring data consistency, supporting schema evolution, and improving overall data quality.