Drop partition columns when writing parquet in pyspark

问题

I have a dataframe with a date column. I have parsed it into year, month, day columns. I want to partition on these columns, but I do not want the columns to persist in the parquet files.

Here is my approach to partitioning and writing the data:

df = df.withColumn('year', f.year(f.col('date_col'))).withColumn('month',f.month(f.col('date_col'))).withColumn('day',f.dayofmonth(f.col('date_col')))

df.write.partitionBy('year','month', 'day').parquet('/mnt/test/test.parquet')

This properly creates the parquet files, including the nested folder structure. However I do not want the year, month, or day columns in the parquet files.

回答1:

Spark/Hive won't write year,month,day columns in your parquet files as they are already in partitionBy clause.

Example:

val df=Seq((1,"a"),(2,"b")).toDF("id","name")
df.coalesce(1).write.partitionBy("id").csv("/user/shu/temporary2") //write csv file.

Checking contents of csv file:

hadoop fs -cat /user/shu/temporary2/id=1/part-00000-dc55f08e-9143-4b60-a94e-e28b1d7d9285.c000.csv

Output:

As you can see there is no id value included in the csv file, in the same way if you write parquet file partition columns are not included in the part-*.parquet file.

To check schema of parquet file:

parquet-tools schema <hdfs://nn:8020/parquet_file>

You can also verify what are all the columns included in your parquet file.

回答2:

If you use df.write.partitionBy('year','month', 'day').

These columns are not actually physically stored in file data. They simply are rendered via the folder structure that partitionBy creates.

Ex. partitionBy('year').csv("/data") will create something like:

/data/year=2018/part1---.csv
/data/year=2019/part1---.csv

When you read the data back it uses the special path year=xxx to populate these columns.

You can prove it by reading in the data of a single partition directly.

Ex. year will not be a column in this case.

df = spark.read.csv("data/year=2019/")
df.printSchema()

Also @Shu's answer could be used to investigate.

You can sleep safely that these columns are not taking up storage space.

If you really don't want to simply see the columns, you could put a view on top of this table that excludes these columns.

来源：https://stackoverflow.com/questions/56743868/drop-partition-columns-when-writing-parquet-in-pyspark

标签

python

apache-spark

pyspark

databricks