avro

Does binary encoding of AVRO compress data?

跟風遠走 提交于 2019-12-04 00:44:36
In one of our projects we are using Kafka with AVRO to transfer data across applications. Data is added to an AVRO object and object is binary encoded to write to Kafka. We use binary encoding as it is generally mentioned as a minimal representation compared to other formats. The data is usually a JSON string and when it is saved in a file, it uses up to 10 Mb of disk. However, when the file is compressed (.zip), it uses only few KBs. We are concerned storing such data in Kafka, so trying to compress before writing to a Kafka topic. When length of binary encoded message (i.e. length of byte

Hive create table with inputs from nested sub-directories

邮差的信 提交于 2019-12-04 00:42:32
问题 I have data in Avro format in HDFS in file paths like: /data/logs/[foldername]/[filename].avro . I want to create a Hive table over all these log files, i.e. all files of the form /data/logs/*/* . (They're all based on the same Avro schema.) I'm running the below query with flag mapred.input.dir.recursive=true : CREATE EXTERNAL TABLE default.testtable ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro

Use schema to convert AVRO messages with Spark to DataFrame

你离开我真会死。 提交于 2019-12-03 17:24:52
问题 Is there a way to use a schema to convert avro messages from kafka with spark to dataframe? The schema file for user records: { "fields": [ { "name": "firstName", "type": "string" }, { "name": "lastName", "type": "string" } ], "name": "user", "type": "record" } And code snippets from SqlNetworkWordCount example and Kafka, Spark and Avro - Part 3, Producing and consuming Avro messages to read in messages. object Injection { val parser = new Schema.Parser() val schema = parser.parse(getClass

Reading a simple Avro file from HDFS

拈花ヽ惹草 提交于 2019-12-03 14:37:49
问题 I am trying to do a simple read of an Avro file stored in HDFS. I found out how to read it when it is on the local file system.... FileReader reader = DataFileReader.openReader(new File(filename), new GenericDatumReader()); for (GenericRecord datum : fileReader) { String value = datum.get(1).toString(); System.out.println("value = " value); } reader.close(); My file is in HDFS, however. I cannot give the openReader a Path or an FSDataInputStream. How can I simply read an Avro file in HDFS?

How to avro binary encode my json string to a byte array?

≡放荡痞女 提交于 2019-12-03 13:22:38
I have a actual JSON String which I need to avro binary encode to a byte array. After going through the Apache Avro specification , I came up with the below code. I am not sure whether this is the right way to do it or not. Can anyone take a look whether the way I am trying to avro binary encode my JSON String is correct or not?. I am using Apache Avro 1.7.7 version. public class AvroTest { private static final String json = "{" + "\"name\":\"Frank\"," + "\"age\":47" + "}"; private static final String schema = "{ \"type\":\"record\", \"namespace\":\"foo\", \"name\":\"Person\", \"fields\":[ { \

Generate Avro Schema from certain Java Object

戏子无情 提交于 2019-12-03 10:56:28
问题 Apache Avro provides a compact, fast, binary data format, rich data structure for serialization. However, it requires user to define a schema (in JSON) for object which need to be serialized. In some case, this can not be possible (e.g: the class of that Java object has some members whose types are external java classes in external libraries). Hence, I wonder there is a tool can get the information from object's .class file and generate the Avro schema for that object (like Gson use object's

Polymorphism and inheritance in Avro schemas

久未见 提交于 2019-12-03 06:50:07
问题 Is it possible to write an Avro schema/IDL that will generate a Java class that either extends a base class or implements an interface? It seems like the generated Java class extends the org.apache.avro.specific.SpecificRecordBase . So, the implements might be the way to go. But, I don't know if this is possible. I have seen examples with suggestions to define an explicit "type" field in each specific schema, with more of an association than inheritance semantics. I use my base class heavily

Use schema to convert AVRO messages with Spark to DataFrame

故事扮演 提交于 2019-12-03 05:46:17
Is there a way to use a schema to convert avro messages from kafka with spark to dataframe ? The schema file for user records: { "fields": [ { "name": "firstName", "type": "string" }, { "name": "lastName", "type": "string" } ], "name": "user", "type": "record" } And code snippets from SqlNetworkWordCount example and Kafka, Spark and Avro - Part 3, Producing and consuming Avro messages to read in messages. object Injection { val parser = new Schema.Parser() val schema = parser.parse(getClass.getResourceAsStream("/user_schema.json")) val injection: Injection[GenericRecord, Array[Byte]] =

Avro with Java 8 dates as logical type

社会主义新天地 提交于 2019-12-03 05:17:36
Latest Avro compiler (1.8.2) generates java sources for dates logical types with Joda-Time based implementations. How can I configure Avro compiler to produce sources that used Java 8 date-time API? Currently (avro 1.8.2) this is not possible. It's hardcoded to generate Joda date/time classes. The current master branch has switched to Java 8 and there is an open issue (with Pull Request ) to add the ability to generate classes with java.time.* types. I have no idea on any kind of release schedule for whatever is currently in master unfortunately. If you feel adventurous you can apply the patch

Reading a simple Avro file from HDFS

风流意气都作罢 提交于 2019-12-03 03:33:36
I am trying to do a simple read of an Avro file stored in HDFS. I found out how to read it when it is on the local file system.... FileReader reader = DataFileReader.openReader(new File(filename), new GenericDatumReader()); for (GenericRecord datum : fileReader) { String value = datum.get(1).toString(); System.out.println("value = " value); } reader.close(); My file is in HDFS, however. I cannot give the openReader a Path or an FSDataInputStream. How can I simply read an Avro file in HDFS? EDIT: I got this to work by creating a custom class (SeekableHadoopInput) that implements SeekableInput.