问题
I can't find a way to deserialize an Apache Avro file with C#. The Avro file is a file generated by the Archive feature in Microsoft Azure Event Hubs.
With Java I can use Avro Tools from Apache to convert the file to JSON:
java -jar avro-tools-1.8.1.jar tojson --pretty inputfile > output.json
Using NuGet package Microsoft.Hadoop.Avro I am able to extract SequenceNumber
, Offset
and EnqueuedTimeUtc
, but since I don't know what type to use for Body
an exception is thrown. I've tried with Dictionary<string, object>
and other types.
static void Main(string[] args)
{
var fileName = "...";
using (Stream stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read))
{
using (var reader = AvroContainer.CreateReader<EventData>(stream))
{
using (var streamReader = new SequentialReader<EventData>(reader))
{
var record = streamReader.Objects.FirstOrDefault();
}
}
}
}
[DataContract(Namespace = "Microsoft.ServiceBus.Messaging")]
public class EventData
{
[DataMember(Name = "SequenceNumber")]
public long SequenceNumber { get; set; }
[DataMember(Name = "Offset")]
public string Offset { get; set; }
[DataMember(Name = "EnqueuedTimeUtc")]
public string EnqueuedTimeUtc { get; set; }
[DataMember(Name = "Body")]
public foo Body { get; set; }
// More properties...
}
The schema looks like this:
{
"type": "record",
"name": "EventData",
"namespace": "Microsoft.ServiceBus.Messaging",
"fields": [
{
"name": "SequenceNumber",
"type": "long"
},
{
"name": "Offset",
"type": "string"
},
{
"name": "EnqueuedTimeUtc",
"type": "string"
},
{
"name": "SystemProperties",
"type": {
"type": "map",
"values": [ "long", "double", "string", "bytes" ]
}
},
{
"name": "Properties",
"type": {
"type": "map",
"values": [ "long", "double", "string", "bytes" ]
}
},
{
"name": "Body",
"type": [ "null", "bytes" ]
}
]
}
回答1:
I was able to get full data access working using dynamic
. Here's the code for accessing the raw body
data, which is stored as an array of bytes. In my case, those bytes contain UTF8-encoded JSON, but of course it depends on how you initially created your EventData
instances that you published to the Event Hub:
using (var reader = AvroContainer.CreateGenericReader(stream))
{
while (reader.MoveNext())
{
foreach (dynamic record in reader.Current.Objects)
{
var sequenceNumber = record.SequenceNumber;
var bodyText = Encoding.UTF8.GetString(record.Body);
Console.WriteLine($"{sequenceNumber}: {bodyText}");
}
}
}
If someone can post a statically-typed solution, I'll upvote it, but given that the bigger latency in any system will almost certainly be the connection to the Event Hub Archive blobs, I wouldn't worry about parsing performance. :)
回答2:
This Gist shows how to deserialize an event hub capture with C# using Microsoft.Hadoop.Avro2, which has the advantage of being both .NET Framework 4.5 and .NET Standard 1.6 compliant:
var connectionString = "<Azure event hub capture storage account connection string>";
var containerName = "<Azure event hub capture container name>";
var blobName = "<Azure event hub capture BLOB name (ends in .avro)>";
var storageAccount = CloudStorageAccount.Parse(connectionString);
var blobClient = storageAccount.CreateCloudBlobClient();
var container = blobClient.GetContainerReference(containerName);
var blob = container.GetBlockBlobReference(blobName);
using (var stream = blob.OpenRead())
using (var reader = AvroContainer.CreateGenericReader(stream))
while (reader.MoveNext())
foreach (dynamic result in reader.Current.Objects)
{
var record = new AvroEventData(result);
record.Dump();
}
public struct AvroEventData
{
public AvroEventData(dynamic record)
{
SequenceNumber = (long) record.SequenceNumber;
Offset = (string) record.Offset;
DateTime.TryParse((string) record.EnqueuedTimeUtc, out var enqueuedTimeUtc);
EnqueuedTimeUtc = enqueuedTimeUtc;
SystemProperties = (Dictionary<string, object>) record.SystemProperties;
Properties = (Dictionary<string, object>) record.Properties;
Body = (byte[]) record.Body;
}
public long SequenceNumber { get; set; }
public string Offset { get; set; }
public DateTime EnqueuedTimeUtc { get; set; }
public Dictionary<string, object> SystemProperties { get; set; }
public Dictionary<string, object> Properties { get; set; }
public byte[] Body { get; set; }
}
NuGet references:
- Microsoft.Hadoop.Avro2 (1.2.1 works)
- WindowsAzure.Storage (8.3.0 works)
Namespaces:
- Microsoft.Hadoop.Avro.Container
- Microsoft.WindowsAzure.Storage
回答3:
I was finally able to get this to work with the Apache C# library / framework.
I was stuck for a while because the Capture feature of the Azure Event Hubs sometimes outputs a file without any message content.
I may have also had an issue with how the messages were originally serialized into the EventData object.
The code below was for a file saved to disk from a capture blob container.
var dataFileReader = DataFileReader<EventData>.OpenReader(file);
foreach (var record in dataFileReader.NextEntries)
{
// Do work on EventData object
}
This also works using the GenericRecord object.
var dataFileReader = DataFileReader<GenericRecord>.OpenReader(file);
This took some effort to figure out. However I now agree this Azure Event Hubs Capture feature is a great feature to backup all events. I still feel they should make the format optional like they did with Stream Analytic job output but maybe I will get used to Avro.
回答4:
Your remaining types, I suspect should be defined as:
[DataContract(Namespace = "Microsoft.ServiceBus.Messaging")]
[KnownType(typeof(Dictionary<string, object>))]
public class EventData
{
[DataMember]
public IDictionary<string, object> SystemProperties { get; set; }
[DataMember]
public IDictionary<string, object> Properties { get; set; }
[DataMember]
public byte[] Body { get; set; }
}
Even though Body
is a union of null
and bytes
, this maps to a nullable
byte[]
.
In C#, arrays are always reference types so can be null
and the contract fulfilled.
回答5:
You can also use NullableSchema
attribute to mark the Body as union of bytes and null. This will allow you to use the strongly typed interface.
[DataContract(Namespace = "Microsoft.ServiceBus.Messaging")]
public class EventData
{
[DataMember(Name = "SequenceNumber")]
public long SequenceNumber { get; set; }
[DataMember(Name = "Offset")]
public string Offset { get; set; }
[DataMember(Name = "EnqueuedTimeUtc")]
public string EnqueuedTimeUtc { get; set; }
[DataMember(Name = "Body")]
[NullableSchema]
public foo Body { get; set; }
}
回答6:
For the people having troubles in serialization/deserialization Apache Avro data in C# I've created a small library which is an interface for Microsoft.Hadoop.Avro:
https://github.com/AdrianStrugala/AvroConvert
https://www.nuget.org/packages/AvroConvert
The usage is as simple as:
byte[] avroFileContent = File.ReadAllBytes(fileName);
Dictionary<string, object> result = AvroConvert.Deserialize(avroFileContent);
//or if u know the model of the data
MyModel result = AvroConvert.Deserialize<MyModel>(avroFileContent);
来源:https://stackoverflow.com/questions/39846833/deserialize-an-avro-file-with-c-sharp