I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn)
into one Key-Multivalue pair (K, [V1, V2, ...
I hit this page while looking for java example for the same problem. (If your case is similar, here is my example)
The trick is - You need to group for keys.
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.StreamSupport;
public class SparkMRExample {
public static void main(String[] args) {
// spark context initialisation
SparkConf conf = new SparkConf()
.setAppName("WordCount")
.setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);
//input for testing;
List input = Arrays.asList("Lorem Ipsum is simply dummy text of the printing and typesetting industry.",
"Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.",
"It has survived not only for centuries, but also the leap into electronic typesetting, remaining essentially unchanged.",
"It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing");
JavaRDD inputRDD = context.parallelize(input);
// the map phase of word count example
JavaPairRDD mappedRDD =
inputRDD.flatMapToPair( line -> // for this input, each string is a line
Arrays.stream(line.split("\\s+")) // splitting into words, converting into stream
.map(word -> new Tuple2<>(word, 1)) // each word is assigned with count 1
.collect(Collectors.toList())); // stream to iterable
// group the tuples by key
// (String,Integer) -> (String, Iterable)
JavaPairRDD> groupedRDD = mappedRDD.groupByKey();
// the reduce phase of word count example
//(String, Iterable) -> (String,Integer)
JavaRDD> resultRDD =
groupedRDD.map(group -> //input is a tuple (String, Iterable)
new Tuple2<>(group._1, // the output key is same as input key
StreamSupport.stream(group._2.spliterator(), true) // converting to stream
.reduce(0, (f, s) -> f + s))); // the sum of counts
//collecting the RRD so that we can print
List> result = resultRDD.collect();
// print each tuple
result.forEach(System.out::println);
}
}