Preserve dataframe partitioning when writing and re-reading to parquet file

你离开我真会死。 提交于 2020-06-25 21:23:15

问题


When I write a dataframe with a defined partitioning to disk as parquet file and then re-read the parquet file again, the partitioning is lost. Is there a way to preserve the original partitioning of the dataframe during writing and re-reading?

The example code

//create a dataframe with 100 partitions and print the number of partitions
val originalDf = spark.sparkContext.parallelize(1 to 10000).toDF().repartition(100)
println("partitions before writing to disk: " + originalDf.rdd.partitions.length)

//write the dataframe to a parquet file and count the number of files actually written to disk
originalDf.write.mode(SaveMode.Overwrite).parquet("tmp/testds")
println("files written to disk: " + new File("tmp/testds").list.size)

//re-read the parquet file into a dataframe and print the number of partitions 
val readDf = spark.read.parquet("tmp/testds")
println("partitions after reading from disk: " + readDf.rdd.partitions.length)

prints out

partitions before writing to disk: 100
files written to disk: 202
partitions after reading from disk: 4

Observations:

  • The first number is the expected result, the dataframe consists of 100 partitions
  • The second number also looks good to me: I get 100 *.parquet files, 100 *.parquet.crc files and two _SUCCESS files, so the parquet file still consists of 100 partitions
  • The third line shows that after reading the parquet file again the original partitioning is lost and the amount of partitions after reading the parquet file is changed. The number of partitions is related to the number of executors of my Spark cluster
  • The results are the same no matter if I write the parquet file to a local disk or a Hdfs store
  • When I run an action on readDf I can see in the SparkUI that four tasks are created, when calling foreachPartition on readDf the function is executed four times

Is there a way to preserve the original partitioning of the dataframe without calling repartition(100) again after reading the parquet file?

Background: in my actual application I write a lot of different datasets with carefully tuned partitions and I would like to restore these partitions without having to record individually for each dataframe how the partitions looked like when writing them to disk.

I am using Spark 2.3.0.


Update: same result for Spark 2.4.6 and 3.0.0

来源:https://stackoverflow.com/questions/51090370/preserve-dataframe-partitioning-when-writing-and-re-reading-to-parquet-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!