Reduce a key-value pair into a key-list pair with Apache Spark

前端 未结 9 1403
生来不讨喜
生来不讨喜 2020-11-27 14:21

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ...

9条回答
  •  难免孤独
    2020-11-27 15:01

    tl;dr If you really require operation like this use groupByKey as suggested by @MariusIon. Every other solution proposed here is either bluntly inefficient are at least suboptimal compared to direct grouping.

    reduceByKey with list concatenation is not an acceptable solution because:

    • Requires initialization of O(N) lists.
    • Each application of + to a pair of lists requires full copy of both lists (O(N)) effectively increasing overall complexity to O(N2).
    • Doesn't address any of the problems introduced by groupByKey. Amount of data that has to be shuffled as well as the size of the final structure are the same.
    • Unlike suggested by one of the answers there is no difference in a level of parallelism between implementation using reduceByKey and groupByKey.

    combineByKey with list.extend is a suboptimal solution because:

    • Creates O(N) list objects in MergeValue (this could be optimized by using list.append directly on the new item).
    • If optimized with list.append it is exactly equivalent to an old (Spark <= 1.3) implementation of a groupByKey and ignores all the optimizations introduced by SPARK-3074 which enables external (on-disk) grouping of the larger-than-memory structures.

提交回复
热议问题