Reduce a key-value pair into a key-list pair with Apache Spark

前端 未结 9 1393
生来不讨喜
生来不讨喜 2020-11-27 14:21

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ...

相关标签:
9条回答
  • 2020-11-27 14:55

    The error message stems from the type for 'a' in your closure.

     My_KMV = My_KV.reduce(lambda a, b: a.append([b]))
    

    Let pySpark explicitly evaluate a as a list. For instance,

    My_KMV = My_KV.reduceByKey(lambda a,b:[a].extend([b]))
    

    In many cases, reduceByKey will be preferable to groupByKey, refer to: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

    0 讨论(0)
  • 2020-11-27 14:57

    I hit this page while looking for java example for the same problem. (If your case is similar, here is my example)

    The trick is - You need to group for keys.

    import org.apache.spark.SparkConf;
    import org.apache.spark.api.java.JavaPairRDD;
    import org.apache.spark.api.java.JavaRDD;
    import org.apache.spark.api.java.JavaSparkContext;
    import scala.Tuple2;
    
    import java.util.Arrays;
    import java.util.List;
    import java.util.stream.Collectors;
    import java.util.stream.StreamSupport;
    
    public class SparkMRExample {
    
        public static void main(String[] args) {
            // spark context initialisation
            SparkConf conf = new SparkConf()
                    .setAppName("WordCount")
                    .setMaster("local");
            JavaSparkContext context = new JavaSparkContext(conf);
    
            //input for testing;
            List<String> input = Arrays.asList("Lorem Ipsum is simply dummy text of the printing and typesetting industry.",
                    "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.",
                    "It has survived not only for centuries, but also the leap into electronic typesetting, remaining essentially unchanged.",
                    "It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing");
            JavaRDD<String> inputRDD = context.parallelize(input);
    
    
            // the map phase of word count example
            JavaPairRDD<String, Integer> mappedRDD =
                    inputRDD.flatMapToPair( line ->                      // for this input, each string is a line
                            Arrays.stream(line.split("\\s+"))            // splitting into words, converting into stream
                                    .map(word -> new Tuple2<>(word, 1))  // each word is assigned with count 1
                                    .collect(Collectors.toList()));      // stream to iterable
    
            // group the tuples by key
            // (String,Integer) -> (String, Iterable<Integer>)
            JavaPairRDD<String, Iterable<Integer>> groupedRDD = mappedRDD.groupByKey();
    
            // the reduce phase of word count example
            //(String, Iterable<Integer>) -> (String,Integer)
            JavaRDD<Tuple2<String, Integer>> resultRDD =
                    groupedRDD.map(group ->                                      //input is a tuple (String, Iterable<Integer>)
                            new Tuple2<>(group._1,                              // the output key is same as input key
                            StreamSupport.stream(group._2.spliterator(), true)  // converting to stream
                                    .reduce(0, (f, s) -> f + s)));              // the sum of counts
            //collecting the RRD so that we can print
            List<Tuple2<String, Integer>> result = resultRDD.collect();
            // print each tuple
            result.forEach(System.out::println);
        }
    }
    
    0 讨论(0)
  • 2020-11-27 15:00

    Map and ReduceByKey

    Input type and output type of reduce must be the same, therefore if you want to aggregate a list, you have to map the input to lists. Afterwards you combine the lists into one list.

    Combining lists

    You'll need a method to combine lists into one list. Python provides some methods to combine lists.

    append modifies the first list and will always return None.

    x = [1, 2, 3]
    x.append([4, 5])
    # x is [1, 2, 3, [4, 5]]
    

    extend does the same, but unwraps lists:

    x = [1, 2, 3]
    x.extend([4, 5])
    # x is [1, 2, 3, 4, 5]
    

    Both methods return None, but you'll need a method that returns the combined list, therefore just use the plus sign.

    x = [1, 2, 3] + [4, 5]
    # x is [1, 2, 3, 4, 5]
    

    Spark

    file = spark.textFile("hdfs://...")
    counts = file.flatMap(lambda line: line.split(" ")) \
             .map(lambda actor: (actor.split(",")[0], actor)) \ 
    
             # transform each value into a list
             .map(lambda nameTuple: (nameTuple[0], [ nameTuple[1] ])) \
    
             # combine lists: ([1,2,3] + [4,5]) becomes [1,2,3,4,5]
             .reduceByKey(lambda a, b: a + b)
    

    CombineByKey

    It's also possible to solve this with combineByKey, which is used internally to implement reduceByKey, but it's more complex and "using one of the specialized per-key combiners in Spark can be much faster". Your use case is simple enough for the upper solution.

    GroupByKey

    It's also possible to solve this with groupByKey, but it reduces parallelization and therefore could be much slower for big data sets.

    0 讨论(0)
  • 2020-11-27 15:00

    I'm kind of late to the conversation, but here's my suggestion:

    >>> foo = sc.parallelize([(1, ('a','b')), (2, ('c','d')), (1, ('x','y'))])
    >>> foo.map(lambda (x,y): (x, [y])).reduceByKey(lambda p,q: p+q).collect()
    [(1, [('a', 'b'), ('x', 'y')]), (2, [('c', 'd')])]
    
    0 讨论(0)
  • 2020-11-27 15:01

    tl;dr If you really require operation like this use groupByKey as suggested by @MariusIon. Every other solution proposed here is either bluntly inefficient are at least suboptimal compared to direct grouping.

    reduceByKey with list concatenation is not an acceptable solution because:

    • Requires initialization of O(N) lists.
    • Each application of + to a pair of lists requires full copy of both lists (O(N)) effectively increasing overall complexity to O(N2).
    • Doesn't address any of the problems introduced by groupByKey. Amount of data that has to be shuffled as well as the size of the final structure are the same.
    • Unlike suggested by one of the answers there is no difference in a level of parallelism between implementation using reduceByKey and groupByKey.

    combineByKey with list.extend is a suboptimal solution because:

    • Creates O(N) list objects in MergeValue (this could be optimized by using list.append directly on the new item).
    • If optimized with list.append it is exactly equivalent to an old (Spark <= 1.3) implementation of a groupByKey and ignores all the optimizations introduced by SPARK-3074 which enables external (on-disk) grouping of the larger-than-memory structures.
    0 讨论(0)
  • 2020-11-27 15:03

    Ok. I hope, I got this right. Your input is something like this:

    kv_input = [("a", 1), ("a", 2), ("a", 3), ("b", 1), ("b", 5)]
    

    and you want to get something like this:

    kmv_output = [("a", [1, 2, 3]), ("b", [1, 5])]
    

    Then this might do the job (see here):

    d = dict()
    for k, v in kv_input:
        d.setdefault(k, list()).append(v)
    kmv_output = list(d.items())
    

    If I got this wrong, please tell me, so I might adjust this to your needs.

    P.S.: a.append([b]) returns always None. You might want to observe either [b] or a but not the result of append.

    0 讨论(0)
提交回复
热议问题