parquet

What are the pros and cons of parquet format compared to other formats?

好久不见. 提交于 2019-11-28 02:39:21
Characteristics of Apache Parquet are : Self-describing Columnar format Language-independent In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats , it gives some insights on the formats but I would like to know how the access to data & storage of data is done in each of these formats. How parquet has an advantage over the others? I think the main difference I can describe relates to record oriented vs. column oriented formats. Record oriented formats are what we're all used to -- text files,

[In A Word]In A Word章4 Parquet、Avro、ORC

泄露秘密 提交于 2019-11-28 00:57:44
行式存储or列式存储:Parquet和ORC都以列的形式存储数据,而Avro以基于行的格式存储数据。 就其本质而言,面向列的数据存储针对读取繁重的分析工作负载进行了优化,而基于行的数据库最适合于大量写入的事务性工作负载。 压缩率:基于列的存储区Parquet和ORC提供的压缩率高于基于行的Avro格式。 可兼容的平台:ORC常用于Hive、Presto;Parquet常用于Impala、Drill、Spark、Arrow;Avro常用于Kafka、Druid。 不同的案例和应用场景选择合适的存储格式,可以提升存储和读取的效率。 来源: https://www.cnblogs.com/szss/p/11384683.html

How to split parquet files into many partitions in Spark?

夙愿已清 提交于 2019-11-27 22:53:52
So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting spark.default.parallelism to 100, we have also tried changing the compression of the parquet to none (from gzip). No matter what we do the first stage of the spark job only has a single partition (once a shuffle occurs it gets repartitioned into 100 and thereafter obviously things are much much faster). Now according to a few sources (like below) parquet should be splittable (even if using gzip!), so I'm super confused and would love some advice.

Does Spark support Partition Pruning with Parquet Files

∥☆過路亽.° 提交于 2019-11-27 22:05:14
I am working with a large dataset, that is partitioned by two columns - plant_name and tag_id . The second partition - tag_id has 200000 unique values, and I mostly access the data by specific tag_id values. If I use the following Spark commands: sqlContext.setConf("spark.sql.hive.metastorePartitionPruning", "true") sqlContext.setConf("spark.sql.parquet.filterPushdown", "true") val df = sqlContext.sql("select * from tag_data where plant_name='PLANT01' and tag_id='1000'") I would expect a fast response as this resolves to a single partition. In Hive and Presto this takes seconds, however in

Is it better to have one large parquet file or lots of smaller parquet files?

↘锁芯ラ 提交于 2019-11-27 17:57:04
问题 I understand hdfs will split files into something like 64mb chunks. We have data coming in streaming and we can store them to large files or medium sized files. What is the optimum size for columnar file storage? If I can store files to where the smallest column is 64mb, would it save any computation time over having, say, 1gb files? 回答1: Aim for around 1GB per file (spark partition) (1). Ideally, you would use snappy compression (default) due to snappy compressed parquet files being

Spark SQL saveAsTable is not compatible with Hive when partition is specified

99封情书 提交于 2019-11-27 13:14:36
问题 Kind of edge case, when saving parquet table in Spark SQL with partition, #schema definitioin final StructType schema = DataTypes.createStructType(Arrays.asList( DataTypes.createStructField("time", DataTypes.StringType, true), DataTypes.createStructField("accountId", DataTypes.StringType, true), ... DataFrame df = hiveContext.read().schema(schema).json(stringJavaRDD); df.coalesce(1) .write() .mode(SaveMode.Append) .format("parquet") .partitionBy("year") .saveAsTable("tblclick8partitioned");

Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)

怎甘沉沦 提交于 2019-11-27 10:57:16
问题 I have a scenario where to convert the messages present as Json object to Apache Parquet format using Java. Any sample code or examples would be helpful. As far as what I have found to convert the messages to Parquet either Hive, Pig, Spark are being used. I need to convert to Parquet without involving these only by Java. 回答1: To convert JSON data files to Parquet, you need some in-memory representation. Parquet doesn't have its own set of Java objects; instead, it reuses the objects from

Reading parquet files from multiple directories in Pyspark

狂风中的少年 提交于 2019-11-27 09:17:53
I need to read parquet files from multiple paths that are not parent or child directories. for example, dir1 --- | ------- dir1_1 | ------- dir1_2 dir2 --- | ------- dir2_1 | ------- dir2_2 sqlContext.read.parquet(dir1) reads parquet files from dir1_1 and dir1_2 Right now I'm reading each dir and merging dataframes using "unionAll". Is there a way to read parquet files from dir1_2 and dir2_1 without using unionAll or is there any fancy way using unionAll Thanks A little late but I found this while I was searching and it may help someone else... You might also try unpacking the argument list to

Why are Spark Parquet files for an aggregate larger than the original?

荒凉一梦 提交于 2019-11-27 09:08:43
I am trying to create an aggregate file for end users to utilize to avoid having them process multiple sources with much larger files. To do that I: A) iterate through all source folders, stripping out 12 fields that are most commonly requested, spinning out parquet files in a new location where these results are co-located. B) I try to go back through the files created in step A and re-aggregate them by grouping by the 12 fields to reduce it to a summary row for each unique combination. What I'm finding is that step A reduces the payload 5:1 (roughly 250 gigs becomes 48.5 gigs). Step B

“Failed to find data source: parquet” when making a fat jar with maven

此生再无相见时 提交于 2019-11-27 07:04:59
问题 I am assembling the fat jar with maven assembly plugin and experience the following issue: Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: parquet. Please find packages at http://spark-packages.org at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:145) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:78) at org.apache.spark.sql.execution.datasources.DataSource