avro

How to decode Kafka messages using Avro and Flink

限于喜欢 提交于 2019-12-06 05:18:29
I am trying to read AVRO data from a Kafka topic using Flink 1.0.3. I just know that this particular Kafka topic is having AVRO encoded message and I am having the AVRO schema file. My Flink code: public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); Properties properties = new Properties(); properties.setProperty("bootstrap.servers", "dojo3xxxxx:9092,dojoxxxxx:9092,dojoxxxxx:9092"); properties.setProperty("zookeeper.connect", "dojo3xxxxx:2181,dojoxxxxx:2181,dojoxxxxx:2181"); properties.setProperty(

Is there a way to programmatically convert JSON to AVRO Schema?

老子叫甜甜 提交于 2019-12-06 04:01:17
问题 I need to create AVRO file but for that I need 2 things: 1) JSON 2) Avro Schema From these 2 requirements - I have JSON: {"web-app": { "servlet": [ { "servlet-name": "cofaxCDS", "servlet-class": "org.cofax.cds.CDSServlet", "init-param": { "configGlossary:installationAt": "Philadelphia, PA", "configGlossary:adminEmail": "ksm@pobox.com", "configGlossary:poweredBy": "Cofax", "configGlossary:poweredByIcon": "/images/cofax.gif", "configGlossary:staticPath": "/content/static",

Why use Avro with Kafka - How to handle POJOs

半城伤御伤魂 提交于 2019-12-06 02:46:00
I have a spring application that is my kafka producer and I was wondering why avro is the best way to go. I read about it and all it has to offer, but why can't I just serialize my POJO that I created myself with jackson for example and send it to kafka? I'm saying this because the POJO generation from avro is not so straight forward. On top of it, it requires the maven plugin and an .avsc file. So for example I have a POJO on my kafka producer created myself called User: public class User { private long userId; private String name; public String getName() { return name; } public void setName

Question populating nested records in Avro using a GenericRecord

强颜欢笑 提交于 2019-12-05 22:41:24
问题 Suppose I’ve got the following schema: { "name" : "Profile", "type" : "record", "fields" : [ { "name" : "firstName", "type" : "string" }, { "name" : "address" , "type" : { "type" : "record", "name" : "AddressUSRecord", "fields" : [ { "name" : "address1" , "type" : "string" }, { "name" : "address2" , "type" : "string" }, { "name" : "city" , "type" : "string" }, { "name" : "state" , "type" : "string" }, { "name" : "zip" , "type" : "int" }, { "name" : "zip4", "type": "int" } ] } } ] } I’m using

Spark - write Avro file

戏子无情 提交于 2019-12-05 13:48:18
What are the common practices to write Avro files with Spark (using Scala API) in a flow like this: parse some logs files from HDFS for each log file apply some business logic and generate Avro file (or maybe merge multiple files) write Avro files to HDFS I tried to use spark-avro, but it doesn't help much. val someLogs = sc.textFile(inputPath) val rowRDD = someLogs.map { line => createRow(...) } val sqlContext = new SQLContext(sc) val dataFrame = sqlContext.createDataFrame(rowRDD, schema) dataFrame.write.avro(outputPath) This fails with error: org.apache.spark.sql.AnalysisException: Reference

Reading Event Hub Archive File in C#

纵饮孤独 提交于 2019-12-05 13:36:02
Is there any sample code in C# for reading the Azure Event Hub Archive files (Avro format)? I am trying to use the Microsoft.Hadoop.Avro library. I dumped the schema out using a java avro tool which produces this: { ""type"":""record"", ""name"":""EventData"", ""namespace"":""Microsoft.ServiceBus.Messaging"", ""fields"":[ {""name"":""SequenceNumber"",""type"":""long""}, {""name"":""Offset"",""type"":""string""}, {""name"":""EnqueuedTimeUtc"",""type"":""string""}, {""name"":""SystemProperties"",""type"":{ ""type"":""map"",""values"":[""long"",""double"",""string"",""bytes""]}}, {""name"":"

How to populate the cache in CachedSchemaRegistryClient without making a call to register a new schema?

最后都变了- 提交于 2019-12-05 12:56:34
we have a spark streaming application which integrates with Kafka, I'm trying to optimize it because it makes excessive calls to Schema Registry to download schema. The avro schema for our data rarely changes, and currently our application calls the Schema Registry whenever a record comes in, which is way too much. I ran into CachedSchemaRegistryClient from confluent, and it looked promising. Though after looking into its implementation I'm not sure how to use its built-in cache to reduce the REST calls to Schema Registry. The above link will bring you to the source code, here I'm pasting the

Reading/writing with Avro schemas AND Parquet format in SparkSQL

落爺英雄遲暮 提交于 2019-12-05 11:08:26
I'm trying to write and read Parquet files from SparkSQL. For reasons of schema evolution, I would like to use Avro schemas with my writes and reads. My understanding is that this is possible outside of Spark (or manually within Spark) using e.g. AvroParquetWriter and Avro's Generic API. However, I would like to use SparkSQL's write() and read() methods (which work with DataFrameWriter and DataFrameReader), and which integrate well with SparkSQL (I will be writing and reading Dataset's). I can't for the life of me figure out how to do this, and am wondering if this is possible at all. The only

Trouble with Avro serialization of json documents missing fields

馋奶兔 提交于 2019-12-05 09:33:47
I'm trying to use Apache Avro to enforce a schema on data exported from Elastic Search into a lot of Avro documents in HDFS (to be queried with Drill). I'm having some trouble with Avro defaults Given this schema: { "namespace" : "avrotest", "type" : "record", "name" : "people", "fields" : [ {"name" : "firstname", "type" : "string"}, {"name" : "age", "type" :"int", "default": -1} ] } I'd expect that a json document such as {"firstname" : "Jane"} would be serialized using the default value of -1 for the age field. default: A default value for this field, used when reading instances that lack

Where an Avro schema is stored when I create a hive table with 'STORED AS AVRO' clause?

南笙酒味 提交于 2019-12-04 22:51:09
问题 There are at least two different ways of creating a hive table backed with Avro data: 1) Creating a table based on an Avro schema (in this example stored in hdfs): CREATE TABLE users_from_avro_schema ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ('avro.schema.url'='hdfs:///user/root/avro/schema/user.avsc')