I can't find a way to deserialize an Apache Avro file with C#. The Avro file is a file generated by the Archive feature in Microsoft Azure Event Hubs.
With Java I can use Avro Tools from Apache to convert the file to JSON:
java -jar avro-tools-1.8.1.jar tojson --pretty inputfile > output.json
Using NuGet package Microsoft.Hadoop.Avro I am able to extract SequenceNumber
, Offset
and EnqueuedTimeUtc
, but since I don't know what type to use for Body
an exception is thrown. I've tried with Dictionary<string, object>
and other types.
static void Main(string[] args)
{
var fileName = "...";
using (Stream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
{
using (var reader = AvroContainer.CreateReader<EventData>(stream))
{
using (var streamReader = new SequentialReader<EventData>(reader))
{
var record = streamReader.Objects.FirstOrDefault();
}
}
}
}
[DataContract(Namespace = "Microsoft.ServiceBus.Messaging")]
public class EventData
{
[DataMember(Name = "SequenceNumber")]
public long SequenceNumber { get; set; }
[DataMember(Name = "Offset")]
public string Offset { get; set; }
[DataMember(Name = "EnqueuedTimeUtc")]
public string EnqueuedTimeUtc { get; set; }
[DataMember(Name = "Body")]
public foo Body { get; set; }
// More properties...
}
The schema looks like this:
{
"type": "record",
"name": "EventData",
"namespace": "Microsoft.ServiceBus.Messaging",
"fields": [
{
"name": "SequenceNumber",
"type": "long"
},
{
"name": "Offset",
"type": "string"
},
{
"name": "EnqueuedTimeUtc",
"type": "string"
},
{
"name": "SystemProperties",
"type": {
"type": "map",
"values": [ "long", "double", "string", "bytes" ]
}
},
{
"name": "Properties",
"type": {
"type": "map",
"values": [ "long", "double", "string", "bytes" ]
}
},
{
"name": "Body",
"type": [ "null", "bytes" ]
}
]
}
I was able to get full data access working using dynamic
. Here's the code for accessing the raw body
data, which is stored as an array of bytes. In my case, those bytes contain UTF8-encoded JSON, but of course it depends on how you initially created your EventData
instances that you published to the Event Hub:
using (var reader = AvroContainer.CreateGenericReader(stream))
{
while (reader.MoveNext())
{
foreach (dynamic record in reader.Current.Objects)
{
var sequenceNumber = record.SequenceNumber;
var bodyText = Encoding.UTF8.GetString(record.Body);
Console.WriteLine($"{sequenceNumber}: {bodyText}");
}
}
}
If someone can post a statically-typed solution, I'll upvote it, but given that the bigger latency in any system will almost certainly be the connection to the Event Hub Archive blobs, I wouldn't worry about parsing performance. :)
This Gist shows how to deserialize an event hub capture with C# using Microsoft.Hadoop.Avro2, which has the advantage of being both .NET Framework 4.5 and .NET Standard 1.6 compliant:
var connectionString = "<Azure event hub capture storage account connection string>";
var containerName = "<Azure event hub capture container name>";
var blobName = "<Azure event hub capture BLOB name (ends in .avro)>";
var storageAccount = CloudStorageAccount.Parse(connectionString);
var blobClient = storageAccount.CreateCloudBlobClient();
var container = blobClient.GetContainerReference(containerName);
var blob = container.GetBlockBlobReference(blobName);
using (var stream = blob.OpenRead())
using (var reader = AvroContainer.CreateGenericReader(stream))
while (reader.MoveNext())
foreach (dynamic result in reader.Current.Objects)
{
var record = new AvroEventData(result);
record.Dump();
}
public struct AvroEventData
{
public AvroEventData(dynamic record)
{
SequenceNumber = (long) record.SequenceNumber;
Offset = (string) record.Offset;
DateTime.TryParse((string) record.EnqueuedTimeUtc, out var enqueuedTimeUtc);
EnqueuedTimeUtc = enqueuedTimeUtc;
SystemProperties = (Dictionary<string, object>) record.SystemProperties;
Properties = (Dictionary<string, object>) record.Properties;
Body = (byte[]) record.Body;
}
public long SequenceNumber { get; set; }
public string Offset { get; set; }
public DateTime EnqueuedTimeUtc { get; set; }
public Dictionary<string, object> SystemProperties { get; set; }
public Dictionary<string, object> Properties { get; set; }
public byte[] Body { get; set; }
}
NuGet references:
- Microsoft.Hadoop.Avro2 (1.2.1 works)
- WindowsAzure.Storage (8.3.0 works)
Namespaces:
- Microsoft.Hadoop.Avro.Container
- Microsoft.WindowsAzure.Storage
I was finally able to get this to work with the Apache C# library / framework.
I was stuck for a while because the Capture feature of the Azure Event Hubs sometimes outputs a file without any message content.
I may have also had an issue with how the messages were originally serialized into the EventData object.
The code below was for a file saved to disk from a capture blob container.
var dataFileReader = DataFileReader<EventData>.OpenReader(file);
foreach (var record in dataFileReader.NextEntries)
{
// Do work on EventData object
}
This also works using the GenericRecord object.
var dataFileReader = DataFileReader<GenericRecord>.OpenReader(file);
This took some effort to figure out. However I now agree this Azure Event Hubs Capture feature is a great feature to backup all events. I still feel they should make the format optional like they did with Stream Analytic job output but maybe I will get used to Avro.
Your remaining types, I suspect should be defined as:
[DataContract(Namespace = "Microsoft.ServiceBus.Messaging")]
[KnownType(typeof(Dictionary<string, object>))]
public class EventData
{
[DataMember]
public IDictionary<string, object> SystemProperties { get; set; }
[DataMember]
public IDictionary<string, object> Properties { get; set; }
[DataMember]
public byte[] Body { get; set; }
}
Even though Body
is a union of null
and bytes
, this maps to a nullable
byte[]
.
In C#, arrays are always reference types so can be null
and the contract fulfilled.
You can also use NullableSchema
attribute to mark the Body as union of bytes and null. This will allow you to use the strongly typed interface.
[DataContract(Namespace = "Microsoft.ServiceBus.Messaging")]
public class EventData
{
[DataMember(Name = "SequenceNumber")]
public long SequenceNumber { get; set; }
[DataMember(Name = "Offset")]
public string Offset { get; set; }
[DataMember(Name = "EnqueuedTimeUtc")]
public string EnqueuedTimeUtc { get; set; }
[DataMember(Name = "Body")]
[NullableSchema]
public foo Body { get; set; }
}
For the people having troubles in serialization/deserialization Apache Avro data in C# I've created a small library which is an interface for Microsoft.Hadoop.Avro:
https://github.com/AdrianStrugala/AvroConvert
https://www.nuget.org/packages/AvroConvert
The usage is as simple as:
byte[] avroFileContent = File.ReadAllBytes(fileName);
Dictionary<string, object> result = AvroConvert.Deserialize(avroFileContent);
//or if u know the model of the data
MyModel result = AvroConvert.Deserialize<MyModel>(avroFileContent);
来源:https://stackoverflow.com/questions/39846833/deserialize-an-avro-file-with-c-sharp