avro

Spark Python Avro Kafka Deserialiser

你说的曾经没有我的故事 提交于 2019-11-28 01:40:07
问题 I have created a kafka stream in a python spark app and can parse any text that comes through it. kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1}) I want to change this to be able to parse avro messages from a kafka topic. When parsing avro messages from a file, I do it like: reader = DataFileReader(open("customer.avro", "r"), DatumReader()) I'm new to python and spark, how do I change the stream to be able to parse the avro message? Also how can I

Kafka Avro Consumer with Decoder issues

有些话、适合烂在心里 提交于 2019-11-28 00:02:31
When I attempted to run Kafka Consumer with Avro over the data with my respective schema,it returns an error of "AvroRuntimeException: Malformed data. Length is negative: -40" . I see others have had similar issues converting byte array to json , Avro write and read , and Kafka Avro Binary *coder . I have also referenced this Consumer Group Example , which have all been helpful, however no help with this error thus far.. It works up until this part of code (line 73) Decoder decoder = DecoderFactory.get().binaryDecoder(byteArrayInputStream, null); I have tried other decoders and printed out the

Apache Avro: map uses CharSequence as key

寵の児 提交于 2019-11-27 21:37:00
I am using Apache Avro . My schema has map type: {"name": "MyData", "type" : {"type": "map", "values":{ "type": "record", "name": "Person", "fields":[ {"name": "name", "type": "string"}, {"name": "age", "type": "int"}, ] } } } After compile the schema, the genated Java class use CharSequence as the key for the Map MyData . It is very inconvenient to use CharSequence in Map as key, is there a way to generate String type key for Map in Apache Avro ? P.S. Problem is that, for example dataMap.containsKey("SOME_KEY") will returns false even though there is such key there, just because it is

What are the key differences between Apache Thrift, Google Protocol Buffers, MessagePack, ASN.1 and Apache Avro?

≡放荡痞女 提交于 2019-11-27 19:41:56
问题 All of these provide binary serialization, RPC frameworks and IDL. I'm interested in key differences between them and characteristics (performance, ease of use, programming languages support). If you know any other similar technologies, please mention it in an answer. 回答1: ASN.1 is an ISO/ISE standard. It has a very readable source language and a variety of back-ends, both binary and human-readable. Being an international standard (and an old one at that!) the source language is a bit kitchen

Using apache avro reflect

不想你离开。 提交于 2019-11-27 15:49:29
问题 Avro serialization is popular with Hadoop users but examples are so hard to find. Can anyone help me with this sample code? I'm mostly interested in using the Reflect API to read/write into files and to use the Union and Null annotations. public class Reflect { public class Packet { int cost; @Nullable TimeStamp stamp; public Packet(int cost, TimeStamp stamp){ this.cost = cost; this.stamp = stamp; } } public class TimeStamp { int hour = 0; int second = 0; public TimeStamp(int hour, int second

How to define avro schema for complex json document?

ぃ、小莉子 提交于 2019-11-27 15:09:53
问题 I have a JSON document that I would like to convert to Avro and need a schema to be specified for that purpose. Here is the JSON document for which I would like to define the avro schema: { "uid": 29153333, "somefield": "somevalue", "options": [ { "item1_lvl2": "a", "item2_lvl2": [ { "item1_lvl3": "x1", "item2_lvl3": "y1" }, { "item1_lvl3": "x2", "item2_lvl3": "y2" } ] } ] } I'm able to define the schema for the non-complex types but not for the complex "options" field: { "namespace" : "my

Reading Avro messages from Kafka with Spark 2.0.2 (structured streaming)

冷暖自知 提交于 2019-11-27 15:07:09
问题 I have a spark 2.0 application that reads messages from kafka using spark streaming (with spark-streaming-kafka-0-10_2.11). Structured streaming looks really cool so I wanted to try and migrate the code but I can't figure out how to use it. in the regular streaming I used kafkaUtils to createDstrean and in the parameters I passed it was the value deserializer. in the Structured streaming the doc says that I should deserialize using DataFrame functions but I can't figure exactly what that

How to read Avro file in PySpark

﹥>﹥吖頭↗ 提交于 2019-11-27 15:01:23
问题 I am writing a spark job using python. However, I need to read in a whole bunch of avro files. This is the closest solution that I have found in Spark's example folder. However, you need to submit this python script using spark-submit. In the command line of spark-submit, you can specify the driver-class, in that case, all your avrokey, avrovalue class will be located. avro_rdd = sc.newAPIHadoopFile( path, "org.apache.avro.mapreduce.AvroKeyInputFormat", "org.apache.avro.mapred.AvroKey", "org

How to Avro Binary encode the JSON String using Apache Avro?

佐手、 提交于 2019-11-27 14:13:10
问题 I am trying to avro binary encode my JSON String. Below is my JSON String and I have created a simple method which will do the conversion but I am not sure whether the way I am doing is correct or not? public static void main(String args[]) throws Exception{ try{ Schema schema = new Parser().parse((TestExample.class.getResourceAsStream("/3233.avsc"))); String json="{"+ " \"location\" : {"+ " \"devices\":["+ " {"+ " \"did\":\"9abd09-439bcd-629a8f\","+ " \"dt\":\"browser\","+ " \"usl\":{"+ " \

How to iterate records spark scala?

柔情痞子 提交于 2019-11-27 13:20:17
问题 I have a variable "myrdd" that is an avro file with 10 records loaded through hadoopfile. When I do myrdd.first_1.datum.getName() I can get the name. Problem is, I have 10 records in "myrdd". When I do: myrdd.map(x => {println(x._1.datum.getName())}) it does not work and prints out a weird object a single time. How can I iterate over all records? 回答1: Here is a log from a session using spark-shell with a similar scenario. Given scala> persons res8: org.apache.spark.sql.DataFrame = [name: