How does Distinct() function work in Spark?
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'm a newbie to Apache Spark and was learning basic functionalities. Had a small doubt.Suppose I have an RDD of tuples (key, value) and wanted to obtain some unique ones out of them. I use distinct() function. I'm wondering on what basis does the function consider that tuples as disparate..? Is it based on the keys, or values, or both? 回答1: .distinct() is definitely doing a shuffle across partitions. To see more of what's happening, run a .toDebugString on your RDD. val hashPart = new HashPartitioner( ) val myRDDPreStep = val myRDD =