Spark dataframe reduceByKey

左心房为你撑大大i 提交于 2019-12-05 18:44:24

To simply preserve the partitioning already achieved then re-use the parent RDD partitioner in the reduceByKey invocation:

 val rdd = df.toRdd
 val parentRdd = rdd.dependencies(0) // Assuming first parent has the 
                                     // desired partitioning: adjust as needed
 val parentPartitioner = parentRdd.partitioner
 val optimizedReducedRdd = rdd.reduceByKey(parentPartitioner, reduceFn)

If you were to not specify the partitioner as follows:

 df.toRdd.reduceByKey(reduceFn)  // This is non-optimized: uses full shuffle

then the behavior you noted would occur - i.e. a full shuffle occurs. That is because the HashPartitioner would be used instead.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!