Spark aggregate on multiple columns within partition without shuffle
问题 I'm trying to aggregate a dataframe on multiple columns. I know that everything I need for the aggregation is within the partition- that is, there's no need for a shuffle because all of the data for the aggregation are local to the partition. Taking an example, if I have something like val sales=sc.parallelize(List( ("West", "Apple", 2.0, 10), ("West", "Apple", 3.0, 15), ("West", "Orange", 5.0, 15), ("South", "Orange", 3.0, 9), ("South", "Orange", 6.0, 18), ("East", "Milk", 5.0, 5)))