parquet

Parquet Data timestamp columns INT96 not yet implemented in Druid Overlord Hadoop task

半城伤御伤魂 提交于 2019-12-20 03:43:25
问题 Context: I am able to submit a MapReduce job from druid overlord to an EMR. My Data source is in S3 in Parquet format. I have a timestamp column (INT96) in parquet data which is not supported in Avroschema. Error is while parsing the timestamp Issue Stack trace is: Error: java.lang.IllegalArgumentException: INT96 not yet implemented. at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279) at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96

How to avoid reading old files from S3 when appending new data?

允我心安 提交于 2019-12-19 12:06:15
问题 Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet'

How to avoid reading old files from S3 when appending new data?

筅森魡賤 提交于 2019-12-19 12:05:26
问题 Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet'

How to append data to an existing parquet file

旧街凉风 提交于 2019-12-19 07:16:10
问题 I'm using the following code to create ParquetWriter and to write records to it. ParquetWriter<GenericRecord> parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE); final GenericRecord record = new GenericData.Record(avroSchema); parquetWriter.write(record); But it only allows to create new files(at the specfied path). Is there a way to append data to an existing parquet file (at path)? Caching parquetWriter is not feasible in my case. 回答1:

How to append data to an existing parquet file

本秂侑毒 提交于 2019-12-19 07:14:05
问题 I'm using the following code to create ParquetWriter and to write records to it. ParquetWriter<GenericRecord> parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE); final GenericRecord record = new GenericData.Record(avroSchema); parquetWriter.write(record); But it only allows to create new files(at the specfied path). Is there a way to append data to an existing parquet file (at path)? Caching parquetWriter is not feasible in my case. 回答1:

Py4JJavaError: An error occurred while calling o26.parquet. (Reading Parquet file)

点点圈 提交于 2019-12-19 03:43:29
问题 Trying to read a Parquet file in PySpark but getting Py4JJavaError . I even tried reading it from the spark-shell and was able to do so. I cannot understand what I am doing wrong here in terms of the Python APIs that it is working in Scala and not in PySpark; spark = SparkSession.builder.master("local").appName("test-read").getOrCreate() sdf = spark.read.parquet("game_logs.parquet") Stack Trace: Py4JJavaError Traceback (most recent call last) <timed exec> in <module>() ~/pyenv/pyenv/lib

Saving Spark dataFrames as parquet files - no errors, but data is not being saved

只愿长相守 提交于 2019-12-19 03:10:37
问题 I want to save a dataframe as a parquet file in Python, but I am only able to save the schema, not the data itself. I have reduced my problem down to a very simple Python test case, which I copied below from IPYNB. Any advice on what might be going on? In [2]: import math import string import datetime import numpy as np import matplotlib.pyplot from pyspark.sql import * import pylab import random import time In [3]: sqlContext = SQLContext(sc) ​#create a simple 1 column dataframe a single row

Saving Spark dataFrames as parquet files - no errors, but data is not being saved

て烟熏妆下的殇ゞ 提交于 2019-12-19 03:10:14
问题 I want to save a dataframe as a parquet file in Python, but I am only able to save the schema, not the data itself. I have reduced my problem down to a very simple Python test case, which I copied below from IPYNB. Any advice on what might be going on? In [2]: import math import string import datetime import numpy as np import matplotlib.pyplot from pyspark.sql import * import pylab import random import time In [3]: sqlContext = SQLContext(sc) ​#create a simple 1 column dataframe a single row

How to read and write Map<String, Object> from/to parquet file in Java or Scala?

你。 提交于 2019-12-18 19:06:11
问题 Looking for a concise example on how to read and write Map<String, Object> from/to parquet file in Java or Scala? Here is expected structure, using com.fasterxml.jackson.databind.ObjectMapper as serializer in Java (i.e. looking for equivalent using parquet): public static Map<String, Object> read(InputStream inputStream) throws IOException { ObjectMapper objectMapper = new ObjectMapper(); return objectMapper.readValue(inputStream, new TypeReference<Map<String, Object>>() { }); } public static

Write pojo's to parquet file using reflection

假如想象 提交于 2019-12-18 17:14:56
问题 HI Looking for APIs to write parquest with Pojos that I have. I was able to generate avro schema using reflection and then create parquet schema using AvroSchemaConverter. Also i am not able to find a way to convert Pojos to GenericRecords (avro) else I could have been able to use AvroParquetWriter to write out the Pojos into parquet files. Any suggestions ? 回答1: If you want to go through avro you have two options: 1) Let avro generate your pojos (see the tutorial here). The generated pojos