How does Spark execute a join + filter? Is it scalable?
问题 Say I have two large RDD's, A and B, containing key-value pairs. I want to join A and B using the key, but of the pairs (a,b) that match, I only want a tiny fraction of "good" ones. So I do the join and apply a filter afterwards: A.join(B).filter(isGoodPair) where isGoodPair is a boolean function that tells me if a pair (a,b) is good or not. For this to scale well, Spark's scheduler would ideally avoid forming all pairs in A.join(B) explicitly. Even on a massively distributed basis, this