avro | 易学教程

Spark Python Avro Kafka Deserialiser

阅读更多关于 Spark Python Avro Kafka Deserialiser

I have created a kafka stream in a python spark app and can parse any text that comes through it. kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1}) I want to change this to be able to parse avro messages from a kafka topic. When parsing avro messages from a file, I do it like: reader = DataFileReader(open("customer.avro", "r"), DatumReader()) I'm new to python and spark, how do I change the stream to be able to parse the avro message? Also how can I specify a schema to use when reading the Avro message from Kafka??? I've done all this in java before

Avro Schema to spark StructType

阅读更多关于 Avro Schema to spark StructType

This is effectively the same as my previous question , but using Avro rather than JSON as the data format. I'm working with a Spark dataframe which could be loading data from one of a few different schema versions: // Version One {"namespace": "com.example.avro", "type": "record", "name": "MeObject", "fields": [ {"name": "A", "type": ["null", "int"], "default": null} ] } // Version Two {"namespace": "com.example.avro", "type": "record", "name": "MeObject", "fields": [ {"name": "A", "type": ["null", "int"], "default": null}, {"name": "B", "type": ["null", "int"], "default": null} ] } I'm using

How to Generate Parquet File Using Pure Java (Including Date & Decimal Types) And Upload to S3 [Windows] (No HDFS)

阅读更多关于 How to Generate Parquet File Using Pure Java (Including Date & Decimal Types) And Upload to S3 [Windows] (No HDFS)

I recently had a requirement where I needed to generate Parquet files that could be read by Apache Spark using only Java (Using no additional software installations such as: Apache Drill, Hive, Spark, etc.). The files needed to be saved to S3 so I will be sharing details on how to do both. There were no simple to follow guides on how to do this. I'm also not a Java programmer so the concepts of using Maven, Hadoop, etc. were all foreign to me. So it took me nearly two weeks to get this working. I'd like to share my personal guide below on how I achieved this Disclaimer: The code samples below

Encode an object with Avro to a byte array in Python

阅读更多关于 Encode an object with Avro to a byte array in Python

In python 2.7, using Avro, I'd like to encode an object to a byte array. All examples I've found write to a file. I've tried using io.BytesIO() but this gives: AttributeError: '_io.BytesIO' object has no attribute 'write_long' Sample using io.BytesIO def avro_encode(raw, schema): writer = DatumWriter(schema) avro_buffer = io.BytesIO() writer.write(raw, avro_buffer) return avro_buffer.getvalue() Your question helped me figure things out, so thanks. Here's a simple python example based on the python example in the docs: import io import avro.schema import avro.io test_schema = ''' { "namespace":

Using apache avro reflect

阅读更多关于 Using apache avro reflect

Avro serialization is popular with Hadoop users but examples are so hard to find. Can anyone help me with this sample code? I'm mostly interested in using the Reflect API to read/write into files and to use the Union and Null annotations. public class Reflect { public class Packet { int cost; @Nullable TimeStamp stamp; public Packet(int cost, TimeStamp stamp){ this.cost = cost; this.stamp = stamp; } } public class TimeStamp { int hour = 0; int second = 0; public TimeStamp(int hour, int second){ this.hour = hour; this.second = second; } } public static void main(String[] args) throws

How to define avro schema for complex json document?

阅读更多关于 How to define avro schema for complex json document?

I have a JSON document that I would like to convert to Avro and need a schema to be specified for that purpose. Here is the JSON document for which I would like to define the avro schema: { "uid": 29153333, "somefield": "somevalue", "options": [ { "item1_lvl2": "a", "item2_lvl2": [ { "item1_lvl3": "x1", "item2_lvl3": "y1" }, { "item1_lvl3": "x2", "item2_lvl3": "y2" } ] } ] } I'm able to define the schema for the non-complex types but not for the complex "options" field: { "namespace" : "my.com.ns", "type" : "record", "fields" : [ {"name": "uid", "type": "int"}, {"name": "somefield", "type":

Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)

阅读更多关于 Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)

I have a spark 2.0 application that reads messages from kafka using spark streaming (with spark-streaming-kafka-0-10_2.11). Structured streaming looks really cool so I wanted to try and migrate the code but I can't figure out how to use it. in the regular streaming I used kafkaUtils to createDstrean and in the parameters I passed it was the value deserializer. in the Structured streaming the doc says that I should deserialize using DataFrame functions but I can't figure exactly what that means. I looked at examples such as this example but my Avro object in Kafka is quit complex and cannot be

How to Avro Binary encode the JSON String using Apache Avro?

阅读更多关于 How to Avro Binary encode the JSON String using Apache Avro?

I am trying to avro binary encode my JSON String. Below is my JSON String and I have created a simple method which will do the conversion but I am not sure whether the way I am doing is correct or not? public static void main(String args[]) throws Exception{ try{ Schema schema = new Parser().parse((TestExample.class.getResourceAsStream("/3233.avsc"))); String json="{"+ " \"location\" : {"+ " \"devices\":["+ " {"+ " \"did\":\"9abd09-439bcd-629a8f\","+ " \"dt\":\"browser\","+ " \"usl\":{"+ " \"pos\":{"+ " \"source\":\"GPS\","+ " \"lat\":90.0,"+ " \"long\":101.0,"+ " \"acc\":100"+ " },"+ " \

How to iterate records spark scala?

阅读更多关于 How to iterate records spark scala?

I have a variable "myrdd" that is an avro file with 10 records loaded through hadoopfile. When I do myrdd.first_1.datum.getName() I can get the name. Problem is, I have 10 records in "myrdd". When I do: myrdd.map(x => {println(x._1.datum.getName())}) it does not work and prints out a weird object a single time. How can I iterate over all records? Here is a log from a session using spark-shell with a similar scenario. Given scala> persons res8: org.apache.spark.sql.DataFrame = [name: string, age: int] scala> persons.first res7: org.apache.spark.sql.Row = [Justin,19] Your issue looks like scala>

How to encode/decode Kafka messages using Avro binary encoder?

阅读更多关于 How to encode/decode Kafka messages using Avro binary encoder?

问题 I'm trying to use Avro for messages being read from/written to Kafka. Does anyone have an example of using the Avro binary encoder to encode/decode data that will be put on a message queue? I need the Avro part more than the Kafka part. Or, perhaps I should look at a different solution? Basically, I'm trying to find a more efficient solution to JSON with regards to space. Avro was just mentioned since it can be more compact than JSON. 回答1: This is a basic example. I have not tried it with