parquet | 易学教程

How to efficiently read multiple small parquet files with Spark? is there a CombineParquetInputFormat?

阅读更多关于 How to efficiently read multiple small parquet files with Spark? is there a CombineParquetInputFormat?

问题 Spark generated multiple small parquet Files. How can one handle efficiently small number of parquet files both on producer and consumer Spark jobs. 回答1: The most straightforward approach IMHO is to use repartition/coalesce (prefer coalesce unless data is skewed and you want to create same-sized outputs) before writing parquet files so that you will not create small files to begin with. df .map(<some transformation>) .filter(<some filter>) ///... .coalesce(<number of partitions>) .write

What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

阅读更多关于 What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

I know that MSCK REPAIR TABLE updates the metastore with the current partitions of an external table. To do that, you only need to do ls on the root folder of the table (given the table is partitioned by only one column), and get all its partitions, clearly a < 1s operation. But in practice, the operation can take a very long time to execute (or even timeout if ran on AWS Athena ). So my question is, what does MSCK REPAIR TABLE actually do behind the scenes and why? How does MSCK REPAIR TABLE find the partitions? Additional data in case it's relevant: Our data is all on S3, it's both slow when

Overwrite parquet files from dynamic frame in AWS Glue

阅读更多关于 Overwrite parquet files from dynamic frame in AWS Glue

I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this: glueContext.write_dynamic_frame.from_options(frame = table, connection_type = "s3", connection_options = {"path": output_dir, "partitionKeys": ["var1","var2"]}, format = "parquet") Is there anything like "mode":"overwrite" that replace my parquet files? Currently AWS Glue doesn't support 'overwrite' mode but they are working on this feature. As a workaround you can convert DynamicFrame object to spark's DataFrame and write it using

Is it better for Spark to select from hive or select from file

阅读更多关于 Is it better for Spark to select from hive or select from file

I was just wondering what people's thoughts were on reading from Hive vs reading from a .csv file or a .txt file or an .ORC file, or a .parquet file. Assuming the underlying Hive table is an external table that has the same file format, would you rather read form a Hive table or from the underlying file itself, and why? Mike tl;dr : I would read it straight from the parquet files I am using Spark 1.5.2 and Hive 1.2.1 For a 5Million row X 100 column table some timings I've recorded are val dffile = sqlContext.read.parquet("/path/to/parquets/*.parquet") val dfhive = sqlContext.table("db.table")

Does Spark maintain parquet partitioning on read?

阅读更多关于 Does Spark maintain parquet partitioning on read?

I am having a lot trouble finding the answer to this question. Let's say I write a dataframe to parquet and I use repartition combined with partitionBy to get a nicely partitioned parquet file. See Below: df.repartition(col("DATE")).write.partitionBy("DATE").parquet("/path/to/parquet/file") Now later on I would like to read the parquet file so I do something like this: val df = spark.read.parquet("/path/to/parquet/file") Is the dataframe partitioned by "DATE" ? In other words if a parquet file is partitioned does spark maintain that partitioning when reading it into a spark dataframe. Or is it

How to write Parquet metadata with pyarrow?

阅读更多关于 How to write Parquet metadata with pyarrow?

问题 I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed. Parquet seems to support file-wide metadata, but I cannot find how the write it via pyarrow. The closest thing I could find is how to write row-group metadata, but this seems like an overkill, since my metadata is the same for all row groups in the file. Is there any way to write file-wide Parquet metadata

HIVE存储格式ORC、PARQUET对比

阅读更多关于 HIVE存储格式ORC、PARQUET对比

　　hive有三种默认的存储格式，TEXT、ORC、PARQUET。TEXT是默认的格式，ORC、PARQUET是列存储格式，占用空间和查询效率是不同的，专门测试过后记录一下。一：建表语句差别 create table if not exists text( a bigint ) partitioned by (dt string) row format delimited fields terminated by '\001' location '/hdfs/text/'; create table if not exists orc( a bigint) partitioned by (dt string) row format delimited fields terminated by '\001' stored as orc location '/hdfs/orc/'; create table if not exists parquet( a bigint) partitioned by (dt string) row format delimited fields terminated by '\001' stored as parquet location '/hdfs/parquet/'; 其实就是stored as 后面跟的不一样二：HDFS存储对比

SparkR collect() and head() error for Spark DataFrame: arguments imply differing number of rows

阅读更多关于 SparkR collect() and head() error for Spark DataFrame: arguments imply differing number of rows

I read a parquet file from HDFS system: path<-"hdfs://part_2015" AppDF <- parquetFile(sqlContext, path) printSchema(AppDF) root |-- app: binary (nullable = true) |-- category: binary (nullable = true) |-- date: binary (nullable = true) |-- user: binary (nullable = true) class(AppDF) [1] "DataFrame" attr(,"package") [1] "SparkR" collect(AppDF) .....error: arguments imply differing number of rows: 46021, 39175, 62744, 27137 head(AppDF) .....error: arguments imply differing number of rows: 36, 30, 48 I've read some thread about this problem. But it's not my case. In fact, I just read a table from

Read Parquet file stored in S3 with AWS Lambda (Python 3)

阅读更多关于 Read Parquet file stored in S3 with AWS Lambda (Python 3)

I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is: https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others). This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python Add a test python function to the zip, send it to S3, update the lambda and test it It seems that there are two possible approaches, which both work locally to

Spark Exception : Task failed while writing rows

阅读更多关于 Spark Exception : Task failed while writing rows

I am reading text files and converting them to parquet files. I am doing it using spark code. But when i try to run the code I get following exception org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 9, XXXX.XXX.XXX.local): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:191) at org.apache.spark.sql.sources