Spark RDD groupByKey + join vs join performance
问题 I am using Spark on the cluster which I am sharing with others users. So it is not reliable to tell which one of my code runs more efficient just based on the running time. Because when I am running the more efficient code, someone else maybe running huge data works and makes my code executes for a longer time. So can I ask 2 questions here: I was using join function to join 2 RDDs and I am trying to use groupByKey() before using join , like this: rdd1.groupByKey().join(rdd2) seems that it