avro | 易学教程

Create a Postgresql table from Avro Schema in Nifi

阅读更多关于 Create a Postgresql table from Avro Schema in Nifi

问题 Using InferAvroSchema I got an Avro Schema of my file. I want to create a table in PostregSql using this Avro schema. Which processor I have to use. I use : GetFile->InferAvroSchema-> I want to create a table from this schema -> Put databaseRecord. The avro schema : { "type" : "record", "name" : "warranty", "doc" : "Schema generated by Kite", "fields" : [ { "name" : "id", "type" : "long", "doc" : "Type inferred from '1'" }, { "name" : "train_id", "type" : "long", "doc" : "Type inferred from

Hive create table with inputs from nested sub-directories

阅读更多关于 Hive create table with inputs from nested sub-directories

I have data in Avro format in HDFS in file paths like: /data/logs/[foldername]/[filename].avro . I want to create a Hive table over all these log files, i.e. all files of the form /data/logs/*/* . (They're all based on the same Avro schema.) I'm running the below query with flag mapred.input.dir.recursive=true : CREATE EXTERNAL TABLE default.testtable ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs://...

Spark: Writing to Avro file

阅读更多关于 Spark: Writing to Avro file

I am in Spark, I have an RDD from an Avro file. I now want to do some transformations on that RDD and save it back as an Avro file: val job = new Job(new Configuration()) AvroJob.setOutputKeySchema(job, getOutputSchema(inputSchema)) rdd.map(elem => (new SparkAvroKey(doTransformation(elem._1)), elem._2)) .saveAsNewAPIHadoopFile(outputPath, classOf[AvroKey[GenericRecord]], classOf[org.apache.hadoop.io.NullWritable], classOf[AvroKeyOutputFormat[GenericRecord]], job.getConfiguration) When running this Spark complains that Schema$recordSchema is not serializable. If I uncomment the .map call (and

Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

阅读更多关于 Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

There is tiny problem when I try Cloudera 5.4.2. Base on this article Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm It tries to fetching tweets using Flume and twitter streaming for data analysis. All things are happy, create Twitter app, create directory on HDFS, configure Flume then start to fetch data, create schema on top of tweets. Then, here is the problem. Twitter streaming converts tweets to Avro format and send Avro events to downsteam HDFS sinks, when Hive table backed by Avro load the data, I got the error message said

Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

阅读更多关于 Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

问题 There is tiny problem when I try Cloudera 5.4.2. Base on this article Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm It tries to fetching tweets using Flume and twitter streaming for data analysis. All things are happy, create Twitter app, create directory on HDFS, configure Flume then start to fetch data, create schema on top of tweets. Then, here is the problem. Twitter streaming converts tweets to Avro format and send Avro events

Compatibility of Avro dates and times with BigQuery?

阅读更多关于 Compatibility of Avro dates and times with BigQuery?

问题 BigQuery generally does a good job of loading Avro data, but "bq load" is having a lot of trouble with timestamps and other date/time fields that use the Avro logicalType attribute. My data with Avro type timestamp-millis is mangled when BigQuery TIMESTAMP interprets them as microsecond timestamps (off by 1000). A timestamp-micros integer that can load into TIMESTAMP becomes INVALID in a BigQuery DATETIME. I can't find an explanation of what would be valid at https://cloud.google.com/bigquery

How to read and write Map<String, Object> from/to parquet file in Java or Scala?

阅读更多关于 How to read and write Map from/to parquet file in Java or Scala?

Looking for a concise example on how to read and write Map<String, Object> from/to parquet file in Java or Scala? Here is expected structure, using com.fasterxml.jackson.databind.ObjectMapper as serializer in Java (i.e. looking for equivalent using parquet): public static Map<String, Object> read(InputStream inputStream) throws IOException { ObjectMapper objectMapper = new ObjectMapper(); return objectMapper.readValue(inputStream, new TypeReference<Map<String, Object>>() { }); } public static void write(OutputStream outputStream, Map<String, Object> map) throws IOException { ObjectMapper

How to define a LogicalType in Avro. (java)

阅读更多关于 How to define a LogicalType in Avro. (java)

问题 I need to be able to mark some fields in the AVRO schema so that they will be encrypted at serialization time. A logicalType allows to mark the fields, and together with a custom conversion should allow to let them be encrypted transparently by AVRO. I had some issues to find documentation on how to define and use a new logicalType in AVRO (avro_1.8.2#Logical+Types). I decided then to share here in the answer what I found, to easy the life of anyone else getting on it and to get some feedback

Avro field default values

阅读更多关于 Avro field default values

问题 I am running into some issues setting up default values for Avro fields. I have a simple schema as given below: data.avsc: { "namespace":"test", "type":"record", "name":"Data", "fields":[ { "name": "id", "type": [ "long", "null" ] }, { "name": "value", "type": [ "string", "null" ] }, { "name": "raw", "type": [ "bytes", "null" ] } ] } I am using the avro-maven-plugin v1.7.6 to generate the Java model. When I create an instance of the model using: Data data = Data.newBuilder().build(); , it

Confluent Maven repository not working?

阅读更多关于 Confluent Maven repository not working?

I need to use the Confluent kafka-avro-serializer Maven artifact. From the official guide I should add this repository to my Maven pom <repository> <id>confluent</id> <url>http://packages.confluent.io/maven/</url> </repository> The problem is that the URL http://packages.confluent.io/maven/ seems to not work at the moment as I get the response below <Error> <Code>NoSuchKey</Code> <Message>The specified key does not exist.</Message> <Key>maven/</Key> <RequestId>15E287D11E5D4DFA</RequestId> <HostId> QVr9lCF0y3SrQoa1Z0jDWtmxD3eJz1gAEdivauojVJ+Bexb2gB6JsMpnXc+JjF95i082hgSLJSM= </HostId> </Error>