ways to replace groupByKey in apache Spark
问题 I would like to know best way to replace groupByKey operation with another. Basically I would like to obtain an RDD[(int,List[Measure]) , my situation: // consider measures like RDD of objects measures.keyBy(_.getId) .groupByKey My idea is to use reduceByKey instead, bacause it cause less shuffle: measures.keyBy(_.getId) .mapValues(List(_)) .reduceByKey(_++_) But I think is very inefficient cause it force me to instantiate a tons of unnecessary List objects. Can anyone have others idea to