avro | 易学教程

Spark Dataframe write to kafka topic in avro format?

阅读更多关于 Spark Dataframe write to kafka topic in avro format?

I have a Dataframe in Spark that looks like eventDF Sno|UserID|TypeExp 1|JAS123|MOVIE 2|ASP123|GAMES 3|JAS123|CLOTHING 4|DPS123|MOVIE 5|DPS123|CLOTHING 6|ASP123|MEDICAL 7|JAS123|OTH 8|POQ133|MEDICAL ....... 10000|DPS123|OTH I need to write it to Kafka topic in Avro format currently i am able to write in Kafka as JSON using following code val kafkaUserDF: DataFrame = eventDF.select(to_json(struct(eventDF.columns.map(column):_*)).alias("value")) kafkaUserDF.selectExpr("CAST(value AS STRING)").write.format("kafka") .option("kafka.bootstrap.servers", "Host:port") .option("topic", "eventdf") .save(

Handling schema changes in running Spark Streaming application

阅读更多关于 Handling schema changes in running Spark Streaming application

问题 I am looking to build a Spark Streaming application using the DataFrames API on Spark 1.6. Before I get too far down the rabbit hole, I was hoping someone could help me understand how DataFrames deals with data having a different schema. The idea is that messages will flow into Kafka with an Avro schema. We should be able to evolve the schema in backwards compatible ways without having to restart the streaming application (the application logic will still work). It appears trivial to

How to extract schema for avro file in python

阅读更多关于 How to extract schema for avro file in python

I am trying to use the Python Avro library ( https://pypi.python.org/pypi/avro ) to read a AVRO file generated by JAVA. Since the schema is already embedded in the avro file, why do I need to specify a schema file? Is there a way to extract it automatically? Found another package called fastavro( https://pypi.python.org/pypi/fastavro ) can extract avro schema. Is the manual specifying schema file in python arvo package by design? Thank you very much. I use python 3.4 and Avro package 1.7.7 For schema file use: reader = avro.datafile.DataFileReader(open('file_name.avro',"rb"),avro.io

Avro field default values

阅读更多关于 Avro field default values

I am running into some issues setting up default values for Avro fields. I have a simple schema as given below: data.avsc: { "namespace":"test", "type":"record", "name":"Data", "fields":[ { "name": "id", "type": [ "long", "null" ] }, { "name": "value", "type": [ "string", "null" ] }, { "name": "raw", "type": [ "bytes", "null" ] } ] } I am using the avro-maven-plugin v1.7.6 to generate the Java model. When I create an instance of the model using: Data data = Data.newBuilder().build(); , it fails with an exception: org.apache.avro.AvroRuntimeException: org.apache.avro.AvroRuntimeException: Field

Getting Started with Avro

阅读更多关于 Getting Started with Avro

I want to get started with using Avro with Map Reduce. Can Someone suggest a good tutorial / example to get started with. I couldnt find much through the internet search. I recently did a project that was heavily based on Avro data and not having used this data format before, I had to start from scratch. You are right in that it is rather hard to get much help from online sources when getting started with Avro. The material that I would recommend to you is: By far, the most helpful source that I found was the Avro section (p103-p116) in Tom White's Hadoop: The Definitive Guide book as well as

How to encode/decode Kafka messages using Avro binary encoder?

阅读更多关于 How to encode/decode Kafka messages using Avro binary encoder?

I'm trying to use Avro for messages being read from/written to Kafka. Does anyone have an example of using the Avro binary encoder to encode/decode data that will be put on a message queue? I need the Avro part more than the Kafka part. Or, perhaps I should look at a different solution? Basically, I'm trying to find a more efficient solution to JSON with regards to space. Avro was just mentioned since it can be more compact than JSON. ramu This is a basic example. I have not tried it with multiple partitions/topics. //Sample producer code import org.apache.avro.Schema; import org.apache.avro

How to convert RDD[GenericRecord] to dataframe in scala?

阅读更多关于 How to convert RDD[GenericRecord] to dataframe in scala?

I get tweets from kafka topic with Avro (serializer and deserializer). Then i create a spark consumer which extracts tweets in Dstream of RDD[GenericRecord]. Now i want to convert each rdd to a dataframe to analyse these tweets via SQL. Any solution to convert RDD[GenericRecord] to dataframe please ? I spent some time trying to make this work (specially how deserialize the data properly but it looks like you already cover this) ... UPDATED //Define function to convert from GenericRecord to Row def genericRecordToRow(record: GenericRecord, sqlType : SchemaConverters.SchemaType): Row = { val

Handling schema changes in running Spark Streaming application

阅读更多关于 Handling schema changes in running Spark Streaming application

I am looking to build a Spark Streaming application using the DataFrames API on Spark 1.6. Before I get too far down the rabbit hole, I was hoping someone could help me understand how DataFrames deals with data having a different schema. The idea is that messages will flow into Kafka with an Avro schema. We should be able to evolve the schema in backwards compatible ways without having to restart the streaming application (the application logic will still work). It appears trivial to deserialize new versions of messages using a schema registry and the schema id embedded in the message using

Confluent Maven repository not working?

阅读更多关于 Confluent Maven repository not working?

问题 I need to use the Confluent kafka-avro-serializer Maven artifact. From the official guide I should add this repository to my Maven pom <repository> <id>confluent</id> <url>http://packages.confluent.io/maven/</url> </repository> The problem is that the URL http://packages.confluent.io/maven/ seems to not work at the moment as I get the response below <Error> <Code>NoSuchKey</Code> <Message>The specified key does not exist.</Message> <Key>maven/</Key> <RequestId>15E287D11E5D4DFA</RequestId>

KafkaAvroDeserializer does not return SpecificRecord but returns GenericRecord

阅读更多关于 KafkaAvroDeserializer does not return SpecificRecord but returns GenericRecord

问题 My KafkaProducer is able to use KafkaAvroSerializer to serialize objects to my topic. However, KafkaConsumer.poll() returns deserialized GenericRecord instead of my serialized class. MyKafkaProducer KafkaProducer<CharSequence, MyBean> producer; try (InputStream props = Resources.getResource("producer.props").openStream()) { Properties properties = new Properties(); properties.load(props); properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers