avro

Create a Postgresql table from Avro Schema in Nifi

眉间皱痕 提交于 2019-12-01 07:28:15
问题 Using InferAvroSchema I got an Avro Schema of my file. I want to create a table in PostregSql using this Avro schema. Which processor I have to use. I use : GetFile->InferAvroSchema-> I want to create a table from this schema -> Put databaseRecord. The avro schema : { "type" : "record", "name" : "warranty", "doc" : "Schema generated by Kite", "fields" : [ { "name" : "id", "type" : "long", "doc" : "Type inferred from '1'" }, { "name" : "train_id", "type" : "long", "doc" : "Type inferred from

Hive create table with inputs from nested sub-directories

本秂侑毒 提交于 2019-12-01 03:16:37
I have data in Avro format in HDFS in file paths like: /data/logs/[foldername]/[filename].avro . I want to create a Hive table over all these log files, i.e. all files of the form /data/logs/*/* . (They're all based on the same Avro schema.) I'm running the below query with flag mapred.input.dir.recursive=true : CREATE EXTERNAL TABLE default.testtable ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 'hdfs://...

Spark: Writing to Avro file

倖福魔咒の 提交于 2019-12-01 02:25:11
I am in Spark, I have an RDD from an Avro file. I now want to do some transformations on that RDD and save it back as an Avro file: val job = new Job(new Configuration()) AvroJob.setOutputKeySchema(job, getOutputSchema(inputSchema)) rdd.map(elem => (new SparkAvroKey(doTransformation(elem._1)), elem._2)) .saveAsNewAPIHadoopFile(outputPath, classOf[AvroKey[GenericRecord]], classOf[org.apache.hadoop.io.NullWritable], classOf[AvroKeyOutputFormat[GenericRecord]], job.getConfiguration) When running this Spark complains that Schema$recordSchema is not serializable. If I uncomment the .map call (and

Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

被刻印的时光 ゝ 提交于 2019-11-30 23:46:40
There is tiny problem when I try Cloudera 5.4.2. Base on this article Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm It tries to fetching tweets using Flume and twitter streaming for data analysis. All things are happy, create Twitter app, create directory on HDFS, configure Flume then start to fetch data, create schema on top of tweets. Then, here is the problem. Twitter streaming converts tweets to Avro format and send Avro events to downsteam HDFS sinks, when Hive table backed by Avro load the data, I got the error message said

Cloudera 5.4.2: Avro block size is invalid or too large when using Flume and Twitter streaming

人走茶凉 提交于 2019-11-30 18:27:41
问题 There is tiny problem when I try Cloudera 5.4.2. Base on this article Apache Flume - Fetching Twitter Data http://www.tutorialspoint.com/apache_flume/fetching_twitter_data.htm It tries to fetching tweets using Flume and twitter streaming for data analysis. All things are happy, create Twitter app, create directory on HDFS, configure Flume then start to fetch data, create schema on top of tweets. Then, here is the problem. Twitter streaming converts tweets to Avro format and send Avro events

Compatibility of Avro dates and times with BigQuery?

 ̄綄美尐妖づ 提交于 2019-11-30 17:53:28
问题 BigQuery generally does a good job of loading Avro data, but "bq load" is having a lot of trouble with timestamps and other date/time fields that use the Avro logicalType attribute. My data with Avro type timestamp-millis is mangled when BigQuery TIMESTAMP interprets them as microsecond timestamps (off by 1000). A timestamp-micros integer that can load into TIMESTAMP becomes INVALID in a BigQuery DATETIME. I can't find an explanation of what would be valid at https://cloud.google.com/bigquery

How to read and write Map<String, Object> from/to parquet file in Java or Scala?

▼魔方 西西 提交于 2019-11-30 17:37:13
Looking for a concise example on how to read and write Map<String, Object> from/to parquet file in Java or Scala? Here is expected structure, using com.fasterxml.jackson.databind.ObjectMapper as serializer in Java (i.e. looking for equivalent using parquet): public static Map<String, Object> read(InputStream inputStream) throws IOException { ObjectMapper objectMapper = new ObjectMapper(); return objectMapper.readValue(inputStream, new TypeReference<Map<String, Object>>() { }); } public static void write(OutputStream outputStream, Map<String, Object> map) throws IOException { ObjectMapper

How to define a LogicalType in Avro. (java)

落爺英雄遲暮 提交于 2019-11-30 14:11:31
问题 I need to be able to mark some fields in the AVRO schema so that they will be encrypted at serialization time. A logicalType allows to mark the fields, and together with a custom conversion should allow to let them be encrypted transparently by AVRO. I had some issues to find documentation on how to define and use a new logicalType in AVRO (avro_1.8.2#Logical+Types). I decided then to share here in the answer what I found, to easy the life of anyone else getting on it and to get some feedback

Avro field default values

偶尔善良 提交于 2019-11-30 12:28:10
问题 I am running into some issues setting up default values for Avro fields. I have a simple schema as given below: data.avsc: { "namespace":"test", "type":"record", "name":"Data", "fields":[ { "name": "id", "type": [ "long", "null" ] }, { "name": "value", "type": [ "string", "null" ] }, { "name": "raw", "type": [ "bytes", "null" ] } ] } I am using the avro-maven-plugin v1.7.6 to generate the Java model. When I create an instance of the model using: Data data = Data.newBuilder().build(); , it

Confluent Maven repository not working?

僤鯓⒐⒋嵵緔 提交于 2019-11-30 11:13:54
I need to use the Confluent kafka-avro-serializer Maven artifact. From the official guide I should add this repository to my Maven pom <repository> <id>confluent</id> <url>http://packages.confluent.io/maven/</url> </repository> The problem is that the URL http://packages.confluent.io/maven/ seems to not work at the moment as I get the response below <Error> <Code>NoSuchKey</Code> <Message>The specified key does not exist.</Message> <Key>maven/</Key> <RequestId>15E287D11E5D4DFA</RequestId> <HostId> QVr9lCF0y3SrQoa1Z0jDWtmxD3eJz1gAEdivauojVJ+Bexb2gB6JsMpnXc+JjF95i082hgSLJSM= </HostId> </Error>