Preserve dataframe partitioning when writing and re-reading to parquet file

问题

When I write a dataframe with a defined partitioning to disk as parquet file and then re-read the parquet file again, the partitioning is lost. Is there a way to preserve the original partitioning of the dataframe during writing and re-reading?

The example code

//create a dataframe with 100 partitions and print the number of partitions
val originalDf = spark.sparkContext.parallelize(1 to 10000).toDF().repartition(100)
println("partitions before writing to disk: " + originalDf.rdd.partitions.length)

//write the dataframe to a parquet file and count the number of files actually written to disk
originalDf.write.mode(SaveMode.Overwrite).parquet("tmp/testds")
println("files written to disk: " + new File("tmp/testds").list.size)

//re-read the parquet file into a dataframe and print the number of partitions 
val readDf = spark.read.parquet("tmp/testds")
println("partitions after reading from disk: " + readDf.rdd.partitions.length)

prints out

partitions before writing to disk: 100
files written to disk: 202
partitions after reading from disk: 4

Observations:

The first number is the expected result, the dataframe consists of 100 partitions
The second number also looks good to me: I get 100 *.parquet files, 100 *.parquet.crc files and two _SUCCESS files, so the parquet file still consists of 100 partitions
The third line shows that after reading the parquet file again the original partitioning is lost and the amount of partitions after reading the parquet file is changed. The number of partitions is related to the number of executors of my Spark cluster
The results are the same no matter if I write the parquet file to a local disk or a Hdfs store
When I run an action on readDf I can see in the SparkUI that four tasks are created, when calling foreachPartition on readDf the function is executed four times

Is there a way to preserve the original partitioning of the dataframe without calling repartition(100) again after reading the parquet file?

Background: in my actual application I write a lot of different datasets with carefully tuned partitions and I would like to restore these partitions without having to record individually for each dataframe how the partitions looked like when writing them to disk.

I am using Spark 2.3.0.

Update: same result for Spark 2.4.6 and 3.0.0

来源：https://stackoverflow.com/questions/51090370/preserve-dataframe-partitioning-when-writing-and-re-reading-to-parquet-file

标签

apache-spark

parquet