问题
I have data partitioned in one way, I just want to partition it in another. So it basically gonna be something like this:
sqlContext.read().parquet("...").write().partitionBy("...").parquet("...")
I wonder does this will trigger shuffle or all data will be re-partition locally, because in this context a partition means just a directory in HDFS and data from the same partition doesn't have to be on the same node to be written in the same dir in HDFS.
回答1:
Neither parititionBy
nor bucketBy
shuffles the data. There are cases though, when repartitioning data first can be a good idea:
df.repartition(...).write.partitionBy(...)
Otherwise the number of the output files is bounded by number of partitions * cardinality of the partitioning column.
来源:https://stackoverflow.com/questions/39805645/does-dataframewriter-partitionby-shuffle-the-data