Reduce a key-value pair into a key-list pair with Apache Spark

前端 未结 9 1422
生来不讨喜
生来不讨喜 2020-11-27 14:21

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ...

9条回答
  •  情书的邮戳
    2020-11-27 15:00

    Map and ReduceByKey

    Input type and output type of reduce must be the same, therefore if you want to aggregate a list, you have to map the input to lists. Afterwards you combine the lists into one list.

    Combining lists

    You'll need a method to combine lists into one list. Python provides some methods to combine lists.

    append modifies the first list and will always return None.

    x = [1, 2, 3]
    x.append([4, 5])
    # x is [1, 2, 3, [4, 5]]
    

    extend does the same, but unwraps lists:

    x = [1, 2, 3]
    x.extend([4, 5])
    # x is [1, 2, 3, 4, 5]
    

    Both methods return None, but you'll need a method that returns the combined list, therefore just use the plus sign.

    x = [1, 2, 3] + [4, 5]
    # x is [1, 2, 3, 4, 5]
    

    Spark

    file = spark.textFile("hdfs://...")
    counts = file.flatMap(lambda line: line.split(" ")) \
             .map(lambda actor: (actor.split(",")[0], actor)) \ 
    
             # transform each value into a list
             .map(lambda nameTuple: (nameTuple[0], [ nameTuple[1] ])) \
    
             # combine lists: ([1,2,3] + [4,5]) becomes [1,2,3,4,5]
             .reduceByKey(lambda a, b: a + b)
    

    CombineByKey

    It's also possible to solve this with combineByKey, which is used internally to implement reduceByKey, but it's more complex and "using one of the specialized per-key combiners in Spark can be much faster". Your use case is simple enough for the upper solution.

    GroupByKey

    It's also possible to solve this with groupByKey, but it reduces parallelization and therefore could be much slower for big data sets.

提交回复
热议问题