Reduce a key-value pair into a key-list pair with Apache Spark

前端 未结 9 1401
生来不讨喜
生来不讨喜 2020-11-27 14:21

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ...

9条回答
  •  孤城傲影
    2020-11-27 14:57

    I hit this page while looking for java example for the same problem. (If your case is similar, here is my example)

    The trick is - You need to group for keys.

    import org.apache.spark.SparkConf;
    import org.apache.spark.api.java.JavaPairRDD;
    import org.apache.spark.api.java.JavaRDD;
    import org.apache.spark.api.java.JavaSparkContext;
    import scala.Tuple2;
    
    import java.util.Arrays;
    import java.util.List;
    import java.util.stream.Collectors;
    import java.util.stream.StreamSupport;
    
    public class SparkMRExample {
    
        public static void main(String[] args) {
            // spark context initialisation
            SparkConf conf = new SparkConf()
                    .setAppName("WordCount")
                    .setMaster("local");
            JavaSparkContext context = new JavaSparkContext(conf);
    
            //input for testing;
            List input = Arrays.asList("Lorem Ipsum is simply dummy text of the printing and typesetting industry.",
                    "Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.",
                    "It has survived not only for centuries, but also the leap into electronic typesetting, remaining essentially unchanged.",
                    "It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing");
            JavaRDD inputRDD = context.parallelize(input);
    
    
            // the map phase of word count example
            JavaPairRDD mappedRDD =
                    inputRDD.flatMapToPair( line ->                      // for this input, each string is a line
                            Arrays.stream(line.split("\\s+"))            // splitting into words, converting into stream
                                    .map(word -> new Tuple2<>(word, 1))  // each word is assigned with count 1
                                    .collect(Collectors.toList()));      // stream to iterable
    
            // group the tuples by key
            // (String,Integer) -> (String, Iterable)
            JavaPairRDD> groupedRDD = mappedRDD.groupByKey();
    
            // the reduce phase of word count example
            //(String, Iterable) -> (String,Integer)
            JavaRDD> resultRDD =
                    groupedRDD.map(group ->                                      //input is a tuple (String, Iterable)
                            new Tuple2<>(group._1,                              // the output key is same as input key
                            StreamSupport.stream(group._2.spliterator(), true)  // converting to stream
                                    .reduce(0, (f, s) -> f + s)));              // the sum of counts
            //collecting the RRD so that we can print
            List> result = resultRDD.collect();
            // print each tuple
            result.forEach(System.out::println);
        }
    }
    

提交回复
热议问题