问题
When I write a dataframe with a defined partitioning to disk as parquet file and then re-read the parquet file again, the partitioning is lost. Is there a way to preserve the original partitioning of the dataframe during writing and re-reading?
The example code
//create a dataframe with 100 partitions and print the number of partitions
val originalDf = spark.sparkContext.parallelize(1 to 10000).toDF().repartition(100)
println("partitions before writing to disk: " + originalDf.rdd.partitions.length)
//write the dataframe to a parquet file and count the number of files actually written to disk
originalDf.write.mode(SaveMode.Overwrite).parquet("tmp/testds")
println("files written to disk: " + new File("tmp/testds").list.size)
//re-read the parquet file into a dataframe and print the number of partitions
val readDf = spark.read.parquet("tmp/testds")
println("partitions after reading from disk: " + readDf.rdd.partitions.length)
prints out
partitions before writing to disk: 100
files written to disk: 202
partitions after reading from disk: 4
Observations:
- The first number is the expected result, the dataframe consists of 100 partitions
- The second number also looks good to me: I get 100
*.parquet
files, 100*.parquet.crc
files and two_SUCCESS
files, so the parquet file still consists of 100 partitions - The third line shows that after reading the parquet file again the original partitioning is lost and the amount of partitions after reading the parquet file is changed. The number of partitions is related to the number of executors of my Spark cluster
- The results are the same no matter if I write the parquet file to a local disk or a Hdfs store
- When I run an action on
readDf
I can see in the SparkUI that four tasks are created, when callingforeachPartition
onreadDf
the function is executed four times
Is there a way to preserve the original partitioning of the dataframe without calling repartition(100)
again after reading the parquet file?
Background: in my actual application I write a lot of different datasets with carefully tuned partitions and I would like to restore these partitions without having to record individually for each dataframe how the partitions looked like when writing them to disk.
I am using Spark 2.3.0.
Update: same result for Spark 2.4.6 and 3.0.0
来源:https://stackoverflow.com/questions/51090370/preserve-dataframe-partitioning-when-writing-and-re-reading-to-parquet-file