parquet

How to write TIMESTAMP logical type (INT96) to parquet, using ParquetWriter?

≡放荡痞女 提交于 2019-12-06 04:08:47
问题 I have a tool that uses a org.apache.parquet.hadoop.ParquetWriter to convert CSV data files to parquet data files. Currently, it only handles int32 , double , and string I need to support the parquet timestamp logical type (annotated as int96), and I am lost on how to do that because I can't find a precise specification online. It appears this timestamp encoding (int96) is rare and not well supported. I've found very little specification details online. This github README states that:

What controls the number of partitions when reading Parquet files?

人走茶凉 提交于 2019-12-06 03:56:08
My setup: Two Spark clusters. One on EC2 and one on Amazon EMR. Both with Spark 1.3.1. The EMR cluster was installed with emr-bootstrap-actions . The EC2 cluster was installed with Spark's default EC2 scripts. The code: Read a folder containing 12 Parquet files and count the number of partitions val logs = sqlContext.parquetFile(“s3n://mylogs/”) logs.rdd.partitions.length Observations: On EC2 this code gives me 12 partitions (one per file, makes sense). On EMR this code gives me 138 (!) partitions. Question: What controls the number of partitions when reading Parquet files? I read the exact

Impala: How to query against multiple parquet files with different schemata

纵饮孤独 提交于 2019-12-06 01:41:07
in Spark 2.1 I often use something like df = spark.read.parquet(/path/to/my/files/*.parquet) to load a folder of parquet files even with different schemata. Then I perform some SQL queries against the dataframe using SparkSQL. Now I want to try Impala because I read the wiki article , which containing sentences like: Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop [...]. Reads Hadoop file formats, including text, LZO, SequenceFile, Avro, RCFile, and Parquet. So it sounds like it could also fit to

How to handle null values when writing to parquet from Spark

眉间皱痕 提交于 2019-12-06 01:29:58
问题 Until recently parquet did not support null values - a questionable premise. In fact a recent version did finally add that support: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md However it will be a long time before spark supports that new parquet feature - if ever. Here is the associated ( closed - will not fix ) JIRA: https://issues.apache.org/jira/browse/SPARK-10943 So what are folks doing with regards to null column values today when writing out dataframe 's to

Streaming parquet file python and only downsampling

↘锁芯ラ 提交于 2019-12-05 21:53:00
I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe ? Ultimately, I would like to have the data in dataframe format to work with. Am I wrong to attempt to do this without using a spark framework? I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be greatly appreciated! Spark is certainly a viable choice for this task. We're planning to add streaming

Overwrite parquet files from dynamic frame in AWS Glue

旧城冷巷雨未停 提交于 2019-12-05 18:54:27
问题 I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this: glueContext.write_dynamic_frame.from_options(frame = table, connection_type = "s3", connection_options = {"path": output_dir, "partitionKeys": ["var1","var2"]}, format = "parquet") Is there anything like "mode":"overwrite" that replace my parquet files? 回答1: Currently AWS Glue doesn't support 'overwrite' mode but they are working

Design of Spark + Parquet “database”

大城市里の小女人 提交于 2019-12-05 18:17:07
I've got 100G text files coming in daily, and I wish to create an efficient "database" accessible from Spark. By "database" I mean the ability to execute fast queries on the data (going back about a year), and incrementally add data each day, preferably without read locks. Assuming I want to use Spark SQL and parquet, what's the best way to achieve this? give up on concurrent reads/writes and append new data to the existing parquet file. create a new parquet file for each day of data, and use the fact that Spark can load multiple parquet files to allow me to load e.g. an entire year. This

How to Query parquet data from Amazon Athena?

你说的曾经没有我的故事 提交于 2019-12-05 14:54:32
Athena creates a temporary table using fields in S3 table. I have done this using JSON data. Could you help me on how to create table using parquet data? I have tried following: Converted sample JSON data to parquet data. Uploaded parquet data to S3. Created temporary table using columns of JSON data. By doing this I am able to a execute query but the result is empty. Is this approach right or is there any other approach to be followed on parquet data? Sample json data: {"_id":"0899f824e118d390f57bc2f279bd38fe","_rev":"1-81cc25723e02f50cb6fef7ce0b0f4f38","deviceId":"BELT001","timestamp":"2016

Why is parquet slower for me against text file format in hive?

99封情书 提交于 2019-12-05 11:15:47
OK! So I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster then plain text files. Please be noted that I am using Hive-0.13 on MapR Follows the flow of my operations Table A Format - Text Format Table size - 2.5 Gb Table B Format - Parquet Table size - 1.9 Gb [Create table B stored as parquet as select * from A] Table C Format - Parquet with snappy compression Table size - 1.9 Gb [Create table C stored as parquet

Reading/writing with Avro schemas AND Parquet format in SparkSQL

落爺英雄遲暮 提交于 2019-12-05 11:08:26
I'm trying to write and read Parquet files from SparkSQL. For reasons of schema evolution, I would like to use Avro schemas with my writes and reads. My understanding is that this is possible outside of Spark (or manually within Spark) using e.g. AvroParquetWriter and Avro's Generic API. However, I would like to use SparkSQL's write() and read() methods (which work with DataFrameWriter and DataFrameReader), and which integrate well with SparkSQL (I will be writing and reading Dataset's). I can't for the life of me figure out how to do this, and am wondering if this is possible at all. The only