Does dataFrameWriter partitionBy shuffle the data?

吃可爱长大的小学妹 提交于 2020-01-15 04:27:11

问题


I have data partitioned in one way, I just want to partition it in another. So it basically gonna be something like this:

sqlContext.read().parquet("...").write().partitionBy("...").parquet("...")

I wonder does this will trigger shuffle or all data will be re-partition locally, because in this context a partition means just a directory in HDFS and data from the same partition doesn't have to be on the same node to be written in the same dir in HDFS.


回答1:


Neither parititionBy nor bucketBy shuffles the data. There are cases though, when repartitioning data first can be a good idea:

df.repartition(...).write.partitionBy(...)

Otherwise the number of the output files is bounded by number of partitions * cardinality of the partitioning column.



来源:https://stackoverflow.com/questions/39805645/does-dataframewriter-partitionby-shuffle-the-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!