parquet | 易学教程

Parquet Data timestamp columns INT96 not yet implemented in Druid Overlord Hadoop task

阅读更多关于 Parquet Data timestamp columns INT96 not yet implemented in Druid Overlord Hadoop task

问题 Context: I am able to submit a MapReduce job from druid overlord to an EMR. My Data source is in S3 in Parquet format. I have a timestamp column (INT96) in parquet data which is not supported in Avroschema. Error is while parsing the timestamp Issue Stack trace is: Error: java.lang.IllegalArgumentException: INT96 not yet implemented. at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96(AvroSchemaConverter.java:279) at org.apache.parquet.avro.AvroSchemaConverter$1.convertINT96

How to avoid reading old files from S3 when appending new data?

阅读更多关于 How to avoid reading old files from S3 when appending new data?

问题 Once in 2 hours, spark job is running to convert some tgz files to parquet. The job appends the new data into an existing parquet in s3: df.write.mode("append").partitionBy("id","day").parquet("s3://myBucket/foo.parquet") In spark-submit output I can see significant time is being spent on reading old parquet files, for example: 16/11/27 14:06:15 INFO S3NativeFileSystem: Opening 's3://myBucket/foo.parquet/id=123/day=2016-11-26/part-r-00003-b20752e9-5d70-43f5-b8b4-50b5b4d0c7da.snappy.parquet'

How to avoid reading old files from S3 when appending new data?

阅读更多关于 How to avoid reading old files from S3 when appending new data?

How to append data to an existing parquet file

阅读更多关于 How to append data to an existing parquet file

问题 I'm using the following code to create ParquetWriter and to write records to it. ParquetWriter<GenericRecord> parquetWriter = new ParquetWriter(path, writeSupport, CompressionCodecName.SNAPPY, BLOCK_SIZE, PAGE_SIZE); final GenericRecord record = new GenericData.Record(avroSchema); parquetWriter.write(record); But it only allows to create new files(at the specfied path). Is there a way to append data to an existing parquet file (at path)? Caching parquetWriter is not feasible in my case. 回答1:

How to append data to an existing parquet file

阅读更多关于 How to append data to an existing parquet file

Py4JJavaError: An error occurred while calling o26.parquet. (Reading Parquet file)

阅读更多关于 Py4JJavaError: An error occurred while calling o26.parquet. (Reading Parquet file)

问题 Trying to read a Parquet file in PySpark but getting Py4JJavaError . I even tried reading it from the spark-shell and was able to do so. I cannot understand what I am doing wrong here in terms of the Python APIs that it is working in Scala and not in PySpark; spark = SparkSession.builder.master("local").appName("test-read").getOrCreate() sdf = spark.read.parquet("game_logs.parquet") Stack Trace: Py4JJavaError Traceback (most recent call last) <timed exec> in <module>() ~/pyenv/pyenv/lib

Saving Spark dataFrames as parquet files - no errors, but data is not being saved

阅读更多关于 Saving Spark dataFrames as parquet files - no errors, but data is not being saved

问题 I want to save a dataframe as a parquet file in Python, but I am only able to save the schema, not the data itself. I have reduced my problem down to a very simple Python test case, which I copied below from IPYNB. Any advice on what might be going on? In [2]: import math import string import datetime import numpy as np import matplotlib.pyplot from pyspark.sql import * import pylab import random import time In [3]: sqlContext = SQLContext(sc) #create a simple 1 column dataframe a single row

Saving Spark dataFrames as parquet files - no errors, but data is not being saved

阅读更多关于 Saving Spark dataFrames as parquet files - no errors, but data is not being saved

How to read and write Map<String, Object> from/to parquet file in Java or Scala?

阅读更多关于 How to read and write Map from/to parquet file in Java or Scala?

问题 Looking for a concise example on how to read and write Map<String, Object> from/to parquet file in Java or Scala? Here is expected structure, using com.fasterxml.jackson.databind.ObjectMapper as serializer in Java (i.e. looking for equivalent using parquet): public static Map<String, Object> read(InputStream inputStream) throws IOException { ObjectMapper objectMapper = new ObjectMapper(); return objectMapper.readValue(inputStream, new TypeReference<Map<String, Object>>() { }); } public static

Write pojo's to parquet file using reflection

阅读更多关于 Write pojo's to parquet file using reflection

问题 HI Looking for APIs to write parquest with Pojos that I have. I was able to generate avro schema using reflection and then create parquet schema using AvroSchemaConverter. Also i am not able to find a way to convert Pojos to GenericRecords (avro) else I could have been able to use AvroParquetWriter to write out the Pojos into parquet files. Any suggestions ? 回答1: If you want to go through avro you have two options: 1) Let avro generate your pojos (see the tutorial here). The generated pojos