avro

Write pojo's to parquet file using reflection

假如想象 提交于 2019-12-18 17:14:56
问题 HI Looking for APIs to write parquest with Pojos that I have. I was able to generate avro schema using reflection and then create parquet schema using AvroSchemaConverter. Also i am not able to find a way to convert Pojos to GenericRecords (avro) else I could have been able to use AvroParquetWriter to write out the Pojos into parquet files. Any suggestions ? 回答1: If you want to go through avro you have two options: 1) Let avro generate your pojos (see the tutorial here). The generated pojos

Spark Dataframe write to kafka topic in avro format?

懵懂的女人 提交于 2019-12-18 13:54:15
问题 I have a Dataframe in Spark that looks like eventDF Sno|UserID|TypeExp 1|JAS123|MOVIE 2|ASP123|GAMES 3|JAS123|CLOTHING 4|DPS123|MOVIE 5|DPS123|CLOTHING 6|ASP123|MEDICAL 7|JAS123|OTH 8|POQ133|MEDICAL ....... 10000|DPS123|OTH I need to write it to Kafka topic in Avro format currently i am able to write in Kafka as JSON using following code val kafkaUserDF: DataFrame = eventDF.select(to_json(struct(eventDF.columns.map(column):_*)).alias("value")) kafkaUserDF.selectExpr("CAST(value AS STRING)")

Is it possible to have an optional field in an Avro schema (i.e. the field does not appear at all in the .json file)?

蹲街弑〆低调 提交于 2019-12-18 13:15:11
问题 Is it possible to have an optional field in an Avro schema (i.e. the field does not appear at all in the .JSON file)? In my Avro schema, I have two fields: {"name": "author", "type": ["null", "string"], "default": null}, {"name": "importance", "type": ["null", "string"], "default": null}, And in my JSON files those two fields can exist or not. However, when they do not exist, I receive an error (e.g. when I test such a JSON file using avro-tools command line client): Expected field name not

How to extract schema for avro file in python

混江龙づ霸主 提交于 2019-12-18 13:07:55
问题 I am trying to use the Python Avro library (https://pypi.python.org/pypi/avro) to read a AVRO file generated by JAVA. Since the schema is already embedded in the avro file, why do I need to specify a schema file? Is there a way to extract it automatically? Found another package called fastavro(https://pypi.python.org/pypi/fastavro) can extract avro schema. Is the manual specifying schema file in python arvo package by design? Thank you very much. 回答1: I use python 3.4 and Avro package 1.7.7

MRUnit with Avro NullPointerException in Serialization

别来无恙 提交于 2019-12-18 08:47:16
问题 I'm trying to test a Hadoop .mapreduce Avro job using MRUnit. I am receiving a NullPointerException as seen below. I've attached a portion of the pom and source code. Any assistance would be appreciated. Thanks The error I'm getting is : java.lang.NullPointerException at org.apache.hadoop.mrunit.internal.io.Serialization.copy(Serialization.java:73) at org.apache.hadoop.mrunit.internal.io.Serialization.copy(Serialization.java:91) at org.apache.hadoop.mrunit.internal.io.Serialization

How to Generate Parquet File Using Pure Java (Including Date & Decimal Types) And Upload to S3 [Windows] (No HDFS)

时光毁灭记忆、已成空白 提交于 2019-12-18 04:23:24
问题 I recently had a requirement where I needed to generate Parquet files that could be read by Apache Spark using only Java (Using no additional software installations such as: Apache Drill, Hive, Spark, etc.). The files needed to be saved to S3 so I will be sharing details on how to do both. There were no simple to follow guides on how to do this. I'm also not a Java programmer so the concepts of using Maven, Hadoop, etc. were all foreign to me. So it took me nearly two weeks to get this

Unable to correctly load twitter avro data into hive table

杀马特。学长 韩版系。学妹 提交于 2019-12-17 20:27:35
问题 Need your help! I am trying a trivial exercise of getting the data from twitter and then loading it up in Hive for analysis. Though I am able to get data into HDFS using flume (using Twitter 1% firehose Source) and also able to load the data into Hive table. But unable to see all the columns I have expected to be there in the twitter data like user_location, user_description, user_friends_count, user_description, user_statuses_count. The schema derived from Avro only contains two columns

Encode an object with Avro to a byte array in Python

*爱你&永不变心* 提交于 2019-12-17 18:52:46
问题 In python 2.7, using Avro, I'd like to encode an object to a byte array. All examples I've found write to a file. I've tried using io.BytesIO() but this gives: AttributeError: '_io.BytesIO' object has no attribute 'write_long' Sample using io.BytesIO def avro_encode(raw, schema): writer = DatumWriter(schema) avro_buffer = io.BytesIO() writer.write(raw, avro_buffer) return avro_buffer.getvalue() 回答1: Your question helped me figure things out, so thanks. Here's a simple python example based on

How to nest records in an Avro schema?

半城伤御伤魂 提交于 2019-12-17 17:56:13
问题 I'm trying to get Python to parse Avro schemas such as the following... from avro import schema mySchema = """ { "name": "person", "type": "record", "fields": [ {"name": "firstname", "type": "string"}, {"name": "lastname", "type": "string"}, { "name": "address", "type": "record", "fields": [ {"name": "streetaddress", "type": "string"}, {"name": "city", "type": "string"} ] } ] }""" parsedSchema = schema.parse(mySchema) ...and I get the following exception: avro.schema.SchemaParseException:

Schema evolution in parquet format

落花浮王杯 提交于 2019-12-17 17:49:06
问题 Currently we are using Avro data format in production. Out of several good points using Avro, we know that it is good in schema evolution. Now we are evaluating Parquet format because of its efficiency while reading random columns. So before moving forward our concern is still schema evolution . Does anyone know if schema evolution is possible in parquet, if yes How is it possible, if no then Why not. Some resources claim that it is possible but it can only add columns at end . What does this