avro

What are the key differences between Apache Thrift, Google Protocol Buffers, MessagePack, ASN.1 and Apache Avro?

我们两清 提交于 2019-11-28 15:06:36
All of these provide binary serialization, RPC frameworks and IDL. I'm interested in key differences between them and characteristics (performance, ease of use, programming languages support). If you know any other similar technologies, please mention it in an answer. JUST MY correct OPINION ASN.1 is an ISO/ISE standard. It has a very readable source language and a variety of back-ends, both binary and human-readable. Being an international standard (and an old one at that!) the source language is a bit kitchen-sinkish (in about the same way that the Atlantic Ocean is a bit wet) but it is

Deserialize an Avro file with C#

浪子不回头ぞ 提交于 2019-11-28 09:56:05
I can't find a way to deserialize an Apache Avro file with C#. The Avro file is a file generated by the Archive feature in Microsoft Azure Event Hubs. With Java I can use Avro Tools from Apache to convert the file to JSON: java -jar avro-tools-1.8.1.jar tojson --pretty inputfile > output.json Using NuGet package Microsoft.Hadoop.Avro I am able to extract SequenceNumber , Offset and EnqueuedTimeUtc , but since I don't know what type to use for Body an exception is thrown. I've tried with Dictionary<string, object> and other types. static void Main(string[] args) { var fileName = "..."; using

Create Hive table to read parquet files from parquet/avro schema

大城市里の小女人 提交于 2019-11-28 09:30:21
We are looking for a solution in order to create an external hive table to read data from parquet files according to a parquet/avro schema. in other way, how to generate a hive table from a parquet/avro schema ? thanks :) Ram Manohar Try below using avro schema: CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc'); CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath'; Same query is asked in Dynamically create Hive external table with

Schema evolution in parquet format

倖福魔咒の 提交于 2019-11-28 06:49:57
Currently we are using Avro data format in production. Out of several good points using Avro, we know that it is good in schema evolution. Now we are evaluating Parquet format because of its efficiency while reading random columns. So before moving forward our concern is still schema evolution . Does anyone know if schema evolution is possible in parquet, if yes How is it possible, if no then Why not. Some resources claim that it is possible but it can only add columns at end . What does this mean? Schema evolution can be (very) expensive. In order to figure out schema, you basically have to

How to convert RDD[GenericRecord] to dataframe in scala?

試著忘記壹切 提交于 2019-11-28 06:19:49
问题 I get tweets from kafka topic with Avro (serializer and deserializer). Then i create a spark consumer which extracts tweets in Dstream of RDD[GenericRecord]. Now i want to convert each rdd to a dataframe to analyse these tweets via SQL. Any solution to convert RDD[GenericRecord] to dataframe please ? 回答1: I spent some time trying to make this work (specially how deserialize the data properly but it looks like you already cover this) ... UPDATED //Define function to convert from GenericRecord

How to nest records in an Avro schema?

笑着哭i 提交于 2019-11-28 05:50:26
I'm trying to get Python to parse Avro schemas such as the following... from avro import schema mySchema = """ { "name": "person", "type": "record", "fields": [ {"name": "firstname", "type": "string"}, {"name": "lastname", "type": "string"}, { "name": "address", "type": "record", "fields": [ {"name": "streetaddress", "type": "string"}, {"name": "city", "type": "string"} ] } ] }""" parsedSchema = schema.parse(mySchema) ...and I get the following exception: avro.schema.SchemaParseException: Type property "record" not a valid Avro schema: Could not make an Avro Schema object from record. What am

How to convert nested avro GenericRecord to Row

对着背影说爱祢 提交于 2019-11-28 04:26:53
问题 I have a code to convert my avro record to Row using function avroToRowConverter() directKafkaStream.foreachRDD(rdd -> { JavaRDD<Row> newRDD= rdd.map(x->{ Injection<GenericRecord, byte[]> recordInjection = GenericAvroCodecs.toBinary(SchemaRegstryClient.getLatestSchema("poc2")); return avroToRowConverter(recordInjection.invert(x._2).get()); }); This function is not working for nested schema (TYPE= UNION) . private static Row avroToRowConverter(GenericRecord avroRecord) { if (null == avroRecord

Avro vs. Parquet

a 夏天 提交于 2019-11-28 03:18:04
I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data! Before I proceed and choose one of the file format, I want to understand what are the disadvantages/drawbacks of one over the other. Can anyone explain it to me in simple terms? If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing between Avro container files and Parquet files is about as simple as swapping out e.g., job.setOutputFormatClass

What are the pros and cons of parquet format compared to other formats?

好久不见. 提交于 2019-11-28 02:39:21
Characteristics of Apache Parquet are : Self-describing Columnar format Language-independent In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats , it gives some insights on the formats but I would like to know how the access to data & storage of data is done in each of these formats. How parquet has an advantage over the others? I think the main difference I can describe relates to record oriented vs. column oriented formats. Record oriented formats are what we're all used to -- text files,

Avro Schema to spark StructType

梦想的初衷 提交于 2019-11-28 01:45:29
问题 This is effectively the same as my previous question, but using Avro rather than JSON as the data format. I'm working with a Spark dataframe which could be loading data from one of a few different schema versions: // Version One {"namespace": "com.example.avro", "type": "record", "name": "MeObject", "fields": [ {"name": "A", "type": ["null", "int"], "default": null} ] } // Version Two {"namespace": "com.example.avro", "type": "record", "name": "MeObject", "fields": [ {"name": "A", "type": [