Does dataFrameWriter partitionBy shuffle the data?

问题

I have data partitioned in one way, I just want to partition it in another. So it basically gonna be something like this:

sqlContext.read().parquet("...").write().partitionBy("...").parquet("...")

I wonder does this will trigger shuffle or all data will be re-partition locally, because in this context a partition means just a directory in HDFS and data from the same partition doesn't have to be on the same node to be written in the same dir in HDFS.

回答1:

Neither parititionBy nor bucketBy shuffles the data. There are cases though, when repartitioning data first can be a good idea:

df.repartition(...).write.partitionBy(...)

Otherwise the number of the output files is bounded by number of partitions * cardinality of the partitioning column.

来源：https://stackoverflow.com/questions/39805645/does-dataframewriter-partitionby-shuffle-the-data

标签

apache-spark

Hadoop

apache-spark-sql

HDFS

partitioning

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!