avro | 易学教程

Spark Python Avro Kafka Deserialiser

阅读更多关于 Spark Python Avro Kafka Deserialiser

问题 I have created a kafka stream in a python spark app and can parse any text that comes through it. kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1}) I want to change this to be able to parse avro messages from a kafka topic. When parsing avro messages from a file, I do it like: reader = DataFileReader(open("customer.avro", "r"), DatumReader()) I'm new to python and spark, how do I change the stream to be able to parse the avro message? Also how can I

Kafka Avro Consumer with Decoder issues

阅读更多关于 Kafka Avro Consumer with Decoder issues

When I attempted to run Kafka Consumer with Avro over the data with my respective schema,it returns an error of "AvroRuntimeException: Malformed data. Length is negative: -40" . I see others have had similar issues converting byte array to json , Avro write and read , and Kafka Avro Binary *coder . I have also referenced this Consumer Group Example , which have all been helpful, however no help with this error thus far.. It works up until this part of code (line 73) Decoder decoder = DecoderFactory.get().binaryDecoder(byteArrayInputStream, null); I have tried other decoders and printed out the

Apache Avro: map uses CharSequence as key

阅读更多关于 Apache Avro: map uses CharSequence as key

I am using Apache Avro . My schema has map type: {"name": "MyData", "type" : {"type": "map", "values":{ "type": "record", "name": "Person", "fields":[ {"name": "name", "type": "string"}, {"name": "age", "type": "int"}, ] } } } After compile the schema, the genated Java class use CharSequence as the key for the Map MyData . It is very inconvenient to use CharSequence in Map as key, is there a way to generate String type key for Map in Apache Avro ? P.S. Problem is that, for example dataMap.containsKey("SOME_KEY") will returns false even though there is such key there, just because it is

What are the key differences between Apache Thrift, Google Protocol Buffers, MessagePack, ASN.1 and Apache Avro?

阅读更多关于 What are the key differences between Apache Thrift, Google Protocol Buffers, MessagePack, ASN.1 and Apache Avro?

问题 All of these provide binary serialization, RPC frameworks and IDL. I'm interested in key differences between them and characteristics (performance, ease of use, programming languages support). If you know any other similar technologies, please mention it in an answer. 回答1: ASN.1 is an ISO/ISE standard. It has a very readable source language and a variety of back-ends, both binary and human-readable. Being an international standard (and an old one at that!) the source language is a bit kitchen

Using apache avro reflect

阅读更多关于 Using apache avro reflect

问题 Avro serialization is popular with Hadoop users but examples are so hard to find. Can anyone help me with this sample code? I'm mostly interested in using the Reflect API to read/write into files and to use the Union and Null annotations. public class Reflect { public class Packet { int cost; @Nullable TimeStamp stamp; public Packet(int cost, TimeStamp stamp){ this.cost = cost; this.stamp = stamp; } } public class TimeStamp { int hour = 0; int second = 0; public TimeStamp(int hour, int second

How to define avro schema for complex json document?

阅读更多关于 How to define avro schema for complex json document?

问题 I have a JSON document that I would like to convert to Avro and need a schema to be specified for that purpose. Here is the JSON document for which I would like to define the avro schema: { "uid": 29153333, "somefield": "somevalue", "options": [ { "item1_lvl2": "a", "item2_lvl2": [ { "item1_lvl3": "x1", "item2_lvl3": "y1" }, { "item1_lvl3": "x2", "item2_lvl3": "y2" } ] } ] } I'm able to define the schema for the non-complex types but not for the complex "options" field: { "namespace" : "my

Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)

阅读更多关于 Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)

问题 I have a spark 2.0 application that reads messages from kafka using spark streaming (with spark-streaming-kafka-0-10_2.11). Structured streaming looks really cool so I wanted to try and migrate the code but I can't figure out how to use it. in the regular streaming I used kafkaUtils to createDstrean and in the parameters I passed it was the value deserializer. in the Structured streaming the doc says that I should deserialize using DataFrame functions but I can't figure exactly what that

How to read Avro file in PySpark

阅读更多关于 How to read Avro file in PySpark

问题 I am writing a spark job using python. However, I need to read in a whole bunch of avro files. This is the closest solution that I have found in Spark's example folder. However, you need to submit this python script using spark-submit. In the command line of spark-submit, you can specify the driver-class, in that case, all your avrokey, avrovalue class will be located. avro_rdd = sc.newAPIHadoopFile( path, "org.apache.avro.mapreduce.AvroKeyInputFormat", "org.apache.avro.mapred.AvroKey", "org

How to Avro Binary encode the JSON String using Apache Avro?

阅读更多关于 How to Avro Binary encode the JSON String using Apache Avro?

问题 I am trying to avro binary encode my JSON String. Below is my JSON String and I have created a simple method which will do the conversion but I am not sure whether the way I am doing is correct or not? public static void main(String args[]) throws Exception{ try{ Schema schema = new Parser().parse((TestExample.class.getResourceAsStream("/3233.avsc"))); String json="{"+ " \"location\" : {"+ " \"devices\":["+ " {"+ " \"did\":\"9abd09-439bcd-629a8f\","+ " \"dt\":\"browser\","+ " \"usl\":{"+ " \

How to iterate records spark scala?

阅读更多关于 How to iterate records spark scala?

问题 I have a variable "myrdd" that is an avro file with 10 records loaded through hadoopfile. When I do myrdd.first_1.datum.getName() I can get the name. Problem is, I have 10 records in "myrdd". When I do: myrdd.map(x => {println(x._1.datum.getName())}) it does not work and prints out a weird object a single time. How can I iterate over all records? 回答1: Here is a log from a session using spark-shell with a similar scenario. Given scala> persons res8: org.apache.spark.sql.DataFrame = [name: