The article shows an automated way of reading Avro data in .NET applications. It includes reading the Avro schema, generating C# models, and deserializing the data. The article contains also practical examples of usage: JSON and Avro benchmarks and Azure Blob Storage client.
What Avro is
Avro is a data type highly popular in the Big Data world (example) with growing popularity also in the .NET area. It has two key features,
- Compression - data is well encoded and serialized to byte representation. Using advanced compression algorithms can only improve the size reduction.
- Clear model - the model of the data is delivered alongside the data itself. What is different, is the format. Model is represented in well-known and human-readable JSON format. It enables backward and forwards compatibility features, which is unique for this level of data compression.
The benefit of compression is the most valuable when working with large collections of data. For example and more information, take a look at the benchmark section at the end of this article.
Read schema from Avro file
Moving to the main topic. Our goal is to handle unknown Avro files, that we are going to process in near future. The first step is to read the schema (model) of the file.
We have multiple options. The easiest way is to manually open notepad, copy the header and extract the schema from it. But I would like to show you, how to do this in an automated way.
The library, which helps with the handling of Avro files is called AvroConvert (link). Its interface is very similar to Newtonsoft.Json library, which makes it very easy to use.
var avroBytes = File.ReadAllBytes("sample.avro");
var schema = AvroConvert.GetSchema(avroBytes);
That's it. The extracted schema looks like follows,
{
"type": "record",
"name": "User",
"namespace": "GrandeBenchmark",
"fields": [{
"name": "Id",
"type": "int"
}, {
"name": "IsActive",
"type": "boolean"
}, {
"name": "Name",
"type": "string"
}, {
"name": "Age",
"type": "int"
}, {
"name": "Contact",
"type": {
"type": "record",
"name": "Contact",
"namespace": "GrandeBenchmark",
"fields": [{
"name": "Id",
"type": "int"
}, {
"name": "Address",
"type": "string"
}, {
"name": "HouseNumber",
"type": "long"
}, {
"name": "City",
"type": "string"
}, {
"name": "PostCode",
"type": "string"
}]
}
}, {
"name": "Offerings",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "Offering",
"namespace": "GrandeBenchmark",
"fields": [{
"name": "Id",
"type": "int"
}, {
"name": "ProductNumber",
"type": {
"type": "string",
"logicalType": "uuid"
}
}, {
"name": "Price",
"type": "int"
}, {
"name": "Currency",
"type": "string"
}, {
"name": "Discount",
"type": "boolean"
}]
}
}
}]
}
Create C# model
The schema comes from the User class from the benchmark sample. What, if it's more complex and contains a significant number of properties and fields? Again, the creation of the C# model can be done in various ways. The simplest is to create classes manually. Another, and more convenient way, is to use again an automated tool. AvroConvert provides a GenerateModel feature, which is also exposed online.
The website https://avroconvertonline.azurewebsites.net/ provides a feature of Avro model creation on the fly.
Read data
When we know what is the structure of data and we have it modeled in our code, we are very close to the finish. The final step is to read the file itself. It’s simple as that:
var avroBytes = File.ReadAllBytes("sample.avro");
var result = AvroConvert.Deserialize<List<User>>(avroBytes);
The result is a list of Users ready to process.
Real-life example - Azure Blob Storage client
Our task is done, but let's take a look at the real-life example. Why do we even bother implementing Avro serialization in our services? One of the possible scenarios is that we are charged for the amount of data in our storage. This could be especially true for external hosts like Microsoft Azure and its document databases. Let's minimize the amount of data (and cost) of Blob Storage.
The example of Azure blob client reducing response time, data size, and cost of the project. The writer and reader are BlobContainerClient extensions:
Initialize blob container
BlobContainerClient blobContainerClient = new BlobContainerClient("yourConnectionString", "containerName");
blobContainerClient.CreateIfNotExists();
Blob writer
public static void WriteItemToBlob(this BlobContainerClient client, string blobName, object content) {
var blob = client.GetBlobClient(blobName);
var serializedContent = AvroConvert.Serialize(content);
blob.DeleteIfExists(DeleteSnapshotsOption.IncludeSnapshots);
blob.Upload(new BinaryData(serializedContent));
}
Blob reader
public static T ReadItemFromBlob < T > (this BlobContainerClient client, string blobName) {
var blob = client.GetBlobClient(blobName);
var content = blob.DownloadContent();
var result = AvroConvert.Deserialize < T > (content.Value.Content.ToArray());
return result;
}
Bonus: Avro and JSON compression benchmark
Benchmark model
public class User {
public int Id {
get;
set;
}
public bool IsActive {
get;
set;
}
public string Name {
get;
set;
}
public int Age {
get;
set;
}
public Contact Contact {
get;
set;
}
public List < Offering > Offerings {
get;
set;
}
}
public class Offering {
public int Id {
get;
set;
}
public Guid ProductNumber {
get;
set;
}
public int Price {
get;
set;
}
public string Currency {
get;
set;
}
public bool Discount {
get;
set;
}
}
public class Contact {
public int Id {
get;
set;
}
public string Address {
get;
set;
}
public long HouseNumber {
get;
set;
}
public string City {
get;
set;
}
public string PostCode {
get;
set;
}
}
Example
data = Autofixture.Fixture.CreateMany<User>(N);
Size of the data by a number of records and serialization method [kB],
|
JSON |
JSON Gzip |
JSON Brotli |
Avro |
Avro Gzip |
Avro Brotli |
1 |
0.74 |
0.44 |
0.41 |
1.26 |
1.15 |
1.13 |
10 |
7.40 |
2.89 |
2.58 |
5.13 |
3.39 |
3.29 |
100 |
75.53 |
28.84 |
25.32 |
44.58 |
27.12 |
25.20 |
1000 |
759.94 |
286.56 |
262.59 |
440.76 |
261.87 |
245.29 |
10000 |
7920.17 |
3081.28 |
2654.37 |
4567.44 |
2800.21 |
2609.99 |
100000 |
80591.47 |
31417.43 |
27344.50 |
46130.41 |
28821.31 |
26625.07 |
1000000 |
807294.01 |
314172.52 |
274262.06 |
461301.09 |
289041.60 |
266187.34 |
The result looks more interesting on the chart above. It shows data from the table, in relation to JSON size.
Conclusion
- Avro does not bring benefits when a dataset is a single item or a small collection (<10 records).
- The more items are in the collection, the bigger benefit serialization brings.
- Using additional compression algorithms brings even more benefits. This is true for both, GZip and Brotli compressions. On the other hand, GZip and Brotli results are very similar for both JSON and Avro. So, what's the benefit? Avro built-in compression mechanism.
Result of Benchmark .NET for 1000 User records,
Method |
Average time |
FileSize |
Json_Default |
21.51 ms |
760 kB |
Json_Brotli |
65.96 ms |
262 kB |
Avro_Default |
14.15 ms |
440 kB |
Avro_Brotli |
45.14 ms |
245 kB |
The table shows serialization and deserialization time. Avro serialization is faster than JSON, in general. But applying compression algorithm highlights this result. Avro Brotli compression is 32% on average than Json Brotli for this case. Avro data can be serialized using a few compression algorithms only by specifying an argument. Supported algorithms,
- Deflate
- Snappy
- GZip
- Brotli
References
- http://avro.apache.org/
- https://cwiki.apache.org/confluence/display/AVRO/Index
- https://github.com/AdrianStrugala/AvroConvert
- https://www.c-sharpcorner.com/Blogs/avro-rest-api-as-the-evolution-of-json-based-communication-between-mic