avro

Spark Dataframe write to kafka topic in avro format?

橙三吉。 提交于 2019-11-30 10:47:35
I have a Dataframe in Spark that looks like eventDF Sno|UserID|TypeExp 1|JAS123|MOVIE 2|ASP123|GAMES 3|JAS123|CLOTHING 4|DPS123|MOVIE 5|DPS123|CLOTHING 6|ASP123|MEDICAL 7|JAS123|OTH 8|POQ133|MEDICAL ....... 10000|DPS123|OTH I need to write it to Kafka topic in Avro format currently i am able to write in Kafka as JSON using following code val kafkaUserDF: DataFrame = eventDF.select(to_json(struct(eventDF.columns.map(column):_*)).alias("value")) kafkaUserDF.selectExpr("CAST(value AS STRING)").write.format("kafka") .option("kafka.bootstrap.servers", "Host:port") .option("topic", "eventdf") .save(

Handling schema changes in running Spark Streaming application

你离开我真会死。 提交于 2019-11-30 09:10:13
问题 I am looking to build a Spark Streaming application using the DataFrames API on Spark 1.6. Before I get too far down the rabbit hole, I was hoping someone could help me understand how DataFrames deals with data having a different schema. The idea is that messages will flow into Kafka with an Avro schema. We should be able to evolve the schema in backwards compatible ways without having to restart the streaming application (the application logic will still work). It appears trivial to

How to extract schema for avro file in python

落爺英雄遲暮 提交于 2019-11-30 08:53:06
I am trying to use the Python Avro library ( https://pypi.python.org/pypi/avro ) to read a AVRO file generated by JAVA. Since the schema is already embedded in the avro file, why do I need to specify a schema file? Is there a way to extract it automatically? Found another package called fastavro( https://pypi.python.org/pypi/fastavro ) can extract avro schema. Is the manual specifying schema file in python arvo package by design? Thank you very much. I use python 3.4 and Avro package 1.7.7 For schema file use: reader = avro.datafile.DataFileReader(open('file_name.avro',"rb"),avro.io

Avro field default values

旧巷老猫 提交于 2019-11-30 04:45:21
I am running into some issues setting up default values for Avro fields. I have a simple schema as given below: data.avsc: { "namespace":"test", "type":"record", "name":"Data", "fields":[ { "name": "id", "type": [ "long", "null" ] }, { "name": "value", "type": [ "string", "null" ] }, { "name": "raw", "type": [ "bytes", "null" ] } ] } I am using the avro-maven-plugin v1.7.6 to generate the Java model. When I create an instance of the model using: Data data = Data.newBuilder().build(); , it fails with an exception: org.apache.avro.AvroRuntimeException: org.apache.avro.AvroRuntimeException: Field

Getting Started with Avro

为君一笑 提交于 2019-11-30 02:00:39
I want to get started with using Avro with Map Reduce. Can Someone suggest a good tutorial / example to get started with. I couldnt find much through the internet search. I recently did a project that was heavily based on Avro data and not having used this data format before, I had to start from scratch. You are right in that it is rather hard to get much help from online sources when getting started with Avro. The material that I would recommend to you is: By far, the most helpful source that I found was the Avro section (p103-p116) in Tom White's Hadoop: The Definitive Guide book as well as

How to encode/decode Kafka messages using Avro binary encoder?

好久不见. 提交于 2019-11-29 20:27:26
I'm trying to use Avro for messages being read from/written to Kafka. Does anyone have an example of using the Avro binary encoder to encode/decode data that will be put on a message queue? I need the Avro part more than the Kafka part. Or, perhaps I should look at a different solution? Basically, I'm trying to find a more efficient solution to JSON with regards to space. Avro was just mentioned since it can be more compact than JSON. ramu This is a basic example. I have not tried it with multiple partitions/topics. //Sample producer code import org.apache.avro.Schema; import org.apache.avro

How to convert RDD[GenericRecord] to dataframe in scala?

回眸只為那壹抹淺笑 提交于 2019-11-29 12:50:33
I get tweets from kafka topic with Avro (serializer and deserializer). Then i create a spark consumer which extracts tweets in Dstream of RDD[GenericRecord]. Now i want to convert each rdd to a dataframe to analyse these tweets via SQL. Any solution to convert RDD[GenericRecord] to dataframe please ? I spent some time trying to make this work (specially how deserialize the data properly but it looks like you already cover this) ... UPDATED //Define function to convert from GenericRecord to Row def genericRecordToRow(record: GenericRecord, sqlType : SchemaConverters.SchemaType): Row = { val

Handling schema changes in running Spark Streaming application

不想你离开。 提交于 2019-11-29 12:10:27
I am looking to build a Spark Streaming application using the DataFrames API on Spark 1.6. Before I get too far down the rabbit hole, I was hoping someone could help me understand how DataFrames deals with data having a different schema. The idea is that messages will flow into Kafka with an Avro schema. We should be able to evolve the schema in backwards compatible ways without having to restart the streaming application (the application logic will still work). It appears trivial to deserialize new versions of messages using a schema registry and the schema id embedded in the message using

Confluent Maven repository not working?

纵饮孤独 提交于 2019-11-29 11:20:17
问题 I need to use the Confluent kafka-avro-serializer Maven artifact. From the official guide I should add this repository to my Maven pom <repository> <id>confluent</id> <url>http://packages.confluent.io/maven/</url> </repository> The problem is that the URL http://packages.confluent.io/maven/ seems to not work at the moment as I get the response below <Error> <Code>NoSuchKey</Code> <Message>The specified key does not exist.</Message> <Key>maven/</Key> <RequestId>15E287D11E5D4DFA</RequestId>

KafkaAvroDeserializer does not return SpecificRecord but returns GenericRecord

时光毁灭记忆、已成空白 提交于 2019-11-29 09:15:22
问题 My KafkaProducer is able to use KafkaAvroSerializer to serialize objects to my topic. However, KafkaConsumer.poll() returns deserialized GenericRecord instead of my serialized class. MyKafkaProducer KafkaProducer<CharSequence, MyBean> producer; try (InputStream props = Resources.getResource("producer.props").openStream()) { Properties properties = new Properties(); properties.load(props); properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers