Drop partition columns when writing parquet in pyspark

≯℡__Kan透↙ 提交于 2020-05-17 07:07:14

问题


I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files.

Here is my approach to partitioning and writing the data:

df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col')))

df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet')

This properly creates the parquet files, including the nested folder structure. However I do not want the year, month, or day columns in the parquet files.


回答1:


Spark/Hive won't write year,month,day columns in your parquet files as they are already in partitionBy clause.

Example:

val df=Seq((1,"a"),(2,"b")).toDF("id","name")
df.coalesce(1).write.partitionBy("id").csv("/user/shu/temporary2") //write csv file.

Checking contents of csv file:

hadoop fs -cat /user/shu/temporary2/id=1/part-00000-dc55f08e-9143-4b60-a94e-e28b1d7d9285.c000.csv

Output:

a

As you can see there is no id value included in the csv file, in the same way if you write parquet file partition columns are not included in the part-*.parquet file.


To check schema of parquet file:

parquet-tools schema <hdfs://nn:8020/parquet_file>

You can also verify what are all the columns included in your parquet file.




回答2:


If you use df.write.partitionBy('year','month', 'day').

These columns are not actually physically stored in file data. They simply are rendered via the folder structure that partitionBy creates.

Ex. partitionBy('year').csv("/data") will create something like:

/data/year=2018/part1---.csv
/data/year=2019/part1---.csv

When you read the data back it uses the special path year=xxx to populate these columns.

You can prove it by reading in the data of a single partition directly.

Ex. year will not be a column in this case.

df = spark.read.csv("data/year=2019/")
df.printSchema()

Also @Shu's answer could be used to investigate.

You can sleep safely that these columns are not taking up storage space.


If you really don't want to simply see the columns, you could put a view on top of this table that excludes these columns.



来源:https://stackoverflow.com/questions/56743868/drop-partition-columns-when-writing-parquet-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!