When create two different Spark Pair RDD with same key set, will Spark distribute partition with same key to the same machine?

问题

I want to do a join operation between two very big key-value pair RDDs. The keys of these two RDD comes from the same set. To reduce data shuffle, I wish I could add a pre-distribute phase so that partitions with the same key will be distributed on the same machine. Hopefully this could reduce some shuffle time.

I want to know is spark smart enough to do that for me or I have to implement this logic myself?

I know when I join two RDD, one preprocess with partitionBy. Spark is smart enough to use this information and only shuffle the other RDD. But I don't know what will happen if I use partitionBy on two RDD at the same time and then do the join.

回答1:

If you use the same partitioner for both RDDs you achieve co-partitioning of your data sets. That does not necessarily mean that your RDDs are co-located - that is, that the partitioned data is located on the same node.

Nevertheless, the performance should be better as if both RDDs would have different partitioner.

回答2:

I have seen this, Speeding Up Joins by Assigning a Known Partitioner that would be helpful to understand the effect of using the same partitioner for both RDDs;

Speeding Up Joins by Assigning a Known Partitioner
If you have to do an operation before the join that requires a shuffle, such as aggregateByKey or reduceByKey, you can prevent the shuffle by adding a hash partitioner with the same number of partitions as an explicit argument to the first operation and persisting the RDD before the join.

来源：https://stackoverflow.com/questions/34368202/when-create-two-different-spark-pair-rdd-with-same-key-set-will-spark-distribut

标签

scala

join

apache-spark

rdd