Leveraging Schema Registry to Ensure Data Compatibility in Kafka

Chethan N
Dec 22
1.5k
0
3

Article

Introduction

Schema Registry plays a critical role in data serialization and deserialization within distributed systems like Apache Kafka. In environments where structured data formats such as Avro, JSON, or Protobuf are used, the Schema Registry helps manage and enforce data structure (schema) consistency across producers and consumers in Kafka topics. This article covers the key aspects of Schema Registry, how it works, its benefits, and practical use cases.

What is Schema Registry?

A Schema Registry is a centralized service that maintains a versioned history of schemas used by Kafka producers and consumers. Its primary role is to ensure that the data exchanged between systems adheres to a predefined format, preventing data compatibility issues that might arise due to schema evolution.

In the context of Kafka, Schema Registry stores schemas for message serialization formats like Avro, JSON Schema, or Protobuf. Each schema is associated with a unique subject, typically tied to a Kafka topic, ensuring that both producers and consumers use the correct schema version.

Key Features of Schema Registry

Schema Versioning: Maintains a history of schema changes, allowing systems to evolve schemas over time.
Compatibility Checking: Ensures that schema changes are backward or forward-compatible to prevent system breakage.
Centralized Storage: Provides a central repository for schema management, ensuring consistency across distributed systems.
Integration with Kafka: Tight integration with Kafka topics to streamline schema management for producers and consumers.

Why Use Schema Registry?

At its fundamental level, Kafka only handles data in byte format without performing any data verification at the cluster level. Kafka itself is unaware of the nature of the data being transmitted, whether it's a string, integer, or any other type.

Kafka Cluster

Because Kafka is designed with a decoupled architecture, producers and consumers do not interact directly; instead, data transfer occurs through Kafka topics. However, consumers still need to understand the type of data being produced in order to properly deserialize it. If a producer starts sending corrupted data or if the data type changes unexpectedly, downstream consumers could fail. This highlights the need for a common, agreed-upon data format.

This is where Schema Registry becomes essential. It operates independently from the Kafka cluster and manages the distribution of schemas to producers and consumers by storing a local copy of each schema in its cache.

Schema registery

With the Schema Registry in place, the producer first interacts with the registry before sending data to Kafka to ensure the required schema is available. If the schema isn't found, the producer registers and caches it in the registry. Once the schema is retrieved, the producer serializes the data using the schema and sends it to Kafka in binary format, along with a unique schema ID. When the consumer processes the message, it communicates with the Schema Registry, using the schema ID to retrieve and deserialize the data with the same schema. If there is a schema mismatch, the Schema Registry triggers an error, informing the producer that the schema agreement has been violated.

In distributed systems, data structures often change over time. Without a system to manage schema evolution, these changes can lead to significant compatibility problems. Here are some key reasons to use the Schema Registry:

Data Integrity: Guarantees that the data produced by applications conforms to the defined schema, minimizing the risk of data corruption.
Backward and Forward Compatibility: Facilitates schema evolution by enforcing compatibility rules, enabling newer schema versions to work with older consumers (backward compatibility) and allowing newer consumers to process data produced with older schemas (forward compatibility).
Reduced Overhead: By leveraging a centralized registry, producers don't need to include the entire schema with each message, resulting in smaller message sizes.
Consistency Across Consumers: Ensures that multiple consumers reading from the same topic are using the correct schema to decode the messages.

Schema Registry Components

Schemas: These are defined data structures that specify the format of messages. Avro is the most widely used schema format in Kafka, favored for its efficient binary encoding.
Subjects: A subject is a category under which a schema is registered in the Schema Registry. Subjects typically map to Kafka topics, such as topic-name-key or topic-name-value.
Versioning: When a schema undergoes changes, a new version is created. The Schema Registry manages the different schema versions for each subject and ensures they remain compatible.
Compatibility Modes
- Backward: The new schema can read data produced with the old schema.
- Forward: The old schema can read data produced with the new schema.
- Full: The new schema must be compatible with both the old schema (backward) and future schema versions (forward).
- None: No compatibility checks are applied.

Data Serialization Formats

Now that we understand how the Schema Registry works, it's important to consider which data serialization format is best suited for use with it. When selecting a serialization format, there are a few key factors to keep in mind:

Whether the serialization format is binary.
The ability to use schemas to enforce strict data structures.

Comparison of data formats: Avro, JSON, and Protobuf

Avro

Avro, developed by Apache, is a widely used data format in Kafka environments, offering several advantages:

It enables the definition of precise schemas for data, ensuring thorough validation and strong backward and forward compatibility.
Its compact and efficient structure is well-suited for environments with limited bandwidth.
Avro supports both primitive types (such as int, boolean, string, float, etc.) and complex types (including enums, arrays, maps, unions, etc.).

However, there are some drawbacks. Avro can be more difficult to implement and use, especially for developers who are not familiar with its format.

Here’s an example of an Avro schema file:

{
  "type" : "record",
  "name" : "User",
  "namespace" : "com.example.models.avro",
  "fields" : [ 
   {"name" : "userID", "type" : "string", "doc" : "User ID of a web app"}, 
   {"name" : "customerName", "type" : "string", "doc" : "Customer Name", "default": "Test User"} 
  ]
}

JSON

JSON is a popular format for web and mobile applications due to its simplicity and readability. Some of its key advantages include:
It is highly readable, making it ideal for situations where human interpretation is important.
It enjoys broad support across a variety of tools and systems.
Its simplicity and lightweight structure make it a good choice for applications where ease of use is a priority.

However, JSON has some limitations. It is less efficient for transmitting large volumes of data, as it uses more bandwidth compared to more compact formats. Additionally, JSON lacks native support for complex data types, which can limit its suitability for applications that require intricate data structures.

Here’s an example of a JSON schema

{
   "id": "invoice_01072022_0982098",
   "description":  "Accommodation invoice" ,
   "url": "https://acmecompany/invoices/invoice_01072022_0982098.pdf",
   "customer":  "Franz Kafka" ,
   "company":  "The Castle" ,
   "total":  120.99,
}

Protobuf

Protobuf, developed by Google, is commonly used in distributed systems due to its several advantages:

Its compact and efficient format makes it well-suited for scenarios with limited bandwidth.
Protobuf supports complex data types, which is beneficial for applications requiring sophisticated data structures.
It is relatively easy to implement across different programming languages.

However, Protobuf has some drawbacks. It can be more challenging to read and understand compared to other formats, especially for developers who are new to it. Additionally, Protobuf does not natively support schema validation, meaning that tools like a schema registry may be needed to ensure data consistency.

Each data format comes with its own strengths and trade-offs:

Avro is best suited for scenarios where complex data structures and strong backward and forward compatibility are essential.
JSON excels in situations that prioritize simplicity, ease of use, and human readability.
Protobuf is ideal for cases where bandwidth efficiency and support for complex data structures are critical.

Schema Evolution

As we add new fields or update existing ones, it's important that our evolving schemas allow downstream consumers to process messages smoothly, without triggering production alerts at inconvenient times. The Schema Registry is designed to handle data evolution by versioning each schema change.

When a schema is first created, it is assigned a unique schema ID and version number. Over time, as the schema evolves and new changes are made, a new schema ID is generated, and the version number is incremented, provided the changes are compatible. To check compatibility, we can use a Maven plugin (in Java) or simply make a REST call to compare the local schema with the one stored in the Schema Registry.

There are several patterns for schema evolution:

Forward Compatibility: Update the producer to use version 2 (V2) of the schema, and gradually update the consumers to also use V2.

Message queue

Backward Compatibility: Update all consumers to the V2 version of the schema first, and then upgrade the producer to use the V2 version.

Backward Compatibility

Full Compatibility: When schemas are compatible in both directions—meaning they are both forward and backward compatible.

How Schema Registry Works?

Producer Workflow

Before sending a message, the producer first submits the schema to the Schema Registry.
The registry validates the schema and assigns it a unique schema ID.
Instead of including the full schema, the producer appends the schema ID to each Kafka message, optimizing bandwidth usage.

Consumer Workflow:

When a consumer reads a message from a Kafka topic, it extracts the schema ID from the message.
The consumer then queries the Schema Registry to retrieve the corresponding schema.
Using the schema, the consumer deserializes the message and processes the data.

By centralizing schema management in the Schema Registry, producers and consumers only need to work with schema IDs, relying on the registry to handle the schema details.

Producer

How can it be implemented with C# and .NET?

JSON

For this, you need to add the following NuGet package:

Confluent.SchemaRegistry.Serdes.Json

dotnet add package Confluent.SchemaRegistry.Serdes.Json --version 2.1.1

Additionally, we need to implement both a Kafka consumer and producer. For the producer, we can set it up as follows:

var producer = new ProducerBuilder<string, T>(producerConfig)
                   .SetValueSerializer(new JsonSerializer<T>(schemaRegistry, 
                                                              jsonSerializerConfig))
                   .Build();

The following code would correspond to the Kafka consumer:

var consumer =  new ConsumerBuilder<string, T>(_consumerConfig)
                .SetKeyDeserializer(Deserializers.Utf8)
                .SetValueDeserializer(new JsonDeserializer<T>()
                                          .AsSyncOverAsync())
                .SetErrorHandler((_, e) 
                        => Console.WriteLine($"Error: {e.Reason}"))
                .Build();

In both cases, we must remember to register the Schema Registry URL.

var schemaRegistry = new CachedSchemaRegistryClient(schemaRegistryConfig);

When we execute the code and send some messages, we will be able to see the schema generated for the configured topic in the Confluent Control Center.

Version1- JSON Schema

Protobuf

When using Protobuf, the process is slightly more complex. First, we need to add the required NuGet packages:

dotnet add package Confluent.SchemaRegistry.Serdes.Protobuf --version 2.1.1
dotnet add package Grpc.Tools --version 2.54.0

Adding the Grpc.Tools package provides the necessary tools to work with .proto files and convert them into a C# file, enabling us to work with them effectively.

syntax = "proto3";

message vehicle
{
    string Registration = 1;
    int32 Speed = 2;
    string Coordinates = 3;
}

To enable the library to convert our .proto file into a .cs file, we need to add the following line in the .csproj file:

<ItemGroup>   
    <Protobuf Include="proto\vehicle.proto" />
</ItemGroup>

When the source generator is compiled, it will generate the .cs file in the project's obj folder, as shown in the following image.

File Structure

Next, let's take a look at how the code changes for the producer:

var producer = new ProducerBuilder<string, T>(producerConfig)
                   .SetValueSerializer(new ProtobufSerializer<T>(schemaRegistry))
                   .Build();

Similarly, here’s how the code changes for the consumer:

var consumer = new ConsumerBuilder<string, T>(consumerConfig)
                     .SetValueDeserializer(new ProtobufDeserializer<T>()
                                               .AsSyncOverAsync())
                     .SetErrorHandler((_, e) 
                          => Console.WriteLine($"Error: {e.Reason}"))
                     .Build();

If we return to the Control Center, we can see that the schema is registered with a Protobuf format.

Current

Avro

To work with Avro, we need to include the following NuGet package:

dotnet add package Confluent.SchemaRegistry.Serdes.Avro --version 2.1.1

We need to add an Avro schema, along with a generator, to work with it from C#.

{
    "namespace": "SchemaRegistryExamples.Avro",
    "name": "Vehicle",
    "type": "record",
    "fields": [
        {
            "name": "registration",
            "type": "string"
        },
        {
            "name": "speed",
            "type": "int"
        },
        {
            "name": "coordinates",
            "type": "string"
        }        
    ]
}

In our case, we need to install the avrogen tool if it's not already installed, and then execute the following command:

#To install the tool
dotnet tool install --global Apache.Avro.Tools

#To convert the Avro schema to a c# class
avrogen -s Vehicle.avsc . --namespace "SchemaRegistryExamples.Avro:AvroConsole.Entity"

For the producer, the setup is very similar to the others:

var producer =   new ProducerBuilder<string, T>(producerConfig)
                    .SetValueSerializer(new AvroSerializer<T>(schemaRegistry))
                    .Build();

Similarly, for the consumer:

var consumer = new ConsumerBuilder<string, T>(consumerConfig)
                        .SetValueDeserializer(new AvroDeserializer<T>(schemaRegistry)
                                                  .AsSyncOverAsync())
                        .SetErrorHandler((_, e) 
                                => Console.WriteLine($"Error: {e.Reason}"))
                        .Build();

As with the other formats, it is necessary to register the Schema Registry.

Finally, we return to the Control Center and observe how the schema is registered in our topic using the Avro format.

Schema ID

Conclusion

In general, the choice of the appropriate data format depends on the specific use case and system requirements. Avro is ideal for systems that require both forward and backward compatibility, JSON is suitable for cases where simplicity and human readability are prioritized, and Protobuf is best for systems that need to handle large volumes of data efficiently.

While Schema Registry is a simple concept, it plays a critical role in enforcing data governance within your Kafka architecture. Schemas are stored outside of your Kafka cluster, with only the schema ID being stored in Kafka itself, making the registry a vital component of your infrastructure. If the Schema Registry becomes unavailable, it can disrupt both producers and consumers. Therefore, ensuring high availability of your Schema Registry is a best practice.