How to get start end and end of each avro record in a compressed avro file?

问题

My problem is this. I have a snappy compressed avro file of 2GB with about 1000 avro records stored on HDFS. I know I can write code to "open up this avro file" and print out each avro record. My question is, is there a way in java to say, open up this avro file, iterate through each record and output into a text file the "start position" and "end position" of each record within that avro file such that... I could have a java function call "readRecord(startposition, endposition)" that could take the startposition and endposition to quickly read out one specific avro record without having to iterate through the whole file?

回答1:

You could compress each record individually. This won't give you as good a compression ratio, but it would be random access.

I suggest using a ZIP or JAR format.

give each record a notional file name, could be just a number.
write the serialized data as the contents of the file to the JAR.

When you want random access

open the JAR
lookup the entry by name.
read it and deserialize.

This will compress the data in the most efficient manner possible for each entry.

回答2:

I don't have time to provide you an off-the-shelf implementation but I think that I can provide you some hints.

Let's start with the Avro Specification: Object Container Files

Basically a Avro file is a suite of self-contained blocks containing one or more records (you can configure the size block and a record will never be split across two blocks). At the beginning of each block you find:

A long indicating the count of objects in this block.
A long indicating the size in bytes of the serialized objects in the current block, after any codec is applied
The serialized objects. If a codec is specified, this is compressed by that codec.
The file's 16-byte sync marker.

The documentation explicitly states "Thus, each block's binary data can be efficiently extracted or skipped without deserializing the contents. The combination of block size, object counts, and sync markers enable detection of corrupt blocks and help ensure data integrity.".

You cannot directly seek to a specific record, but you can seek to a given block then iterate over its objects. It is not exactly what you need, but seems close enough. I believe that you won't be able to do much better than that with Avro containers. You can still tweak the block size to bound maximum the number of iteration within a block. When compression is used, it is applied at block level so it won't be an issue.

I believe that a such reader can be implemented using only public Avro API (FileDataReader provides seek and sync methods etc.)

来源：https://stackoverflow.com/questions/32528644/how-to-get-start-end-and-end-of-each-avro-record-in-a-compressed-avro-file

标签

java

avro