I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn)
into one Key-Multivalue pair (K, [V1, V2, ...
The error message stems from the type for 'a' in your closure.
My_KMV = My_KV.reduce(lambda a, b: a.append([b]))
Let pySpark explicitly evaluate a as a list. For instance,
My_KMV = My_KV.reduceByKey(lambda a,b:[a].extend([b]))
In many cases, reduceByKey will be preferable to groupByKey, refer to: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
I hit this page while looking for java example for the same problem. (If your case is similar, here is my example)
The trick is - You need to group for keys.
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.StreamSupport;
public class SparkMRExample {
public static void main(String[] args) {
// spark context initialisation
SparkConf conf = new SparkConf()
.setAppName("WordCount")
.setMaster("local");
JavaSparkContext context = new JavaSparkContext(conf);
//input for testing;
List<String> input = Arrays.asList("Lorem Ipsum is simply dummy text of the printing and typesetting industry.",
"Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.",
"It has survived not only for centuries, but also the leap into electronic typesetting, remaining essentially unchanged.",
"It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing");
JavaRDD<String> inputRDD = context.parallelize(input);
// the map phase of word count example
JavaPairRDD<String, Integer> mappedRDD =
inputRDD.flatMapToPair( line -> // for this input, each string is a line
Arrays.stream(line.split("\\s+")) // splitting into words, converting into stream
.map(word -> new Tuple2<>(word, 1)) // each word is assigned with count 1
.collect(Collectors.toList())); // stream to iterable
// group the tuples by key
// (String,Integer) -> (String, Iterable<Integer>)
JavaPairRDD<String, Iterable<Integer>> groupedRDD = mappedRDD.groupByKey();
// the reduce phase of word count example
//(String, Iterable<Integer>) -> (String,Integer)
JavaRDD<Tuple2<String, Integer>> resultRDD =
groupedRDD.map(group -> //input is a tuple (String, Iterable<Integer>)
new Tuple2<>(group._1, // the output key is same as input key
StreamSupport.stream(group._2.spliterator(), true) // converting to stream
.reduce(0, (f, s) -> f + s))); // the sum of counts
//collecting the RRD so that we can print
List<Tuple2<String, Integer>> result = resultRDD.collect();
// print each tuple
result.forEach(System.out::println);
}
}
Map and ReduceByKey
Input type and output type of reduce
must be the same, therefore if you want to aggregate a list, you have to map
the input to lists. Afterwards you combine the lists into one list.
Combining lists
You'll need a method to combine lists into one list. Python provides some methods to combine lists.
append
modifies the first list and will always return None
.
x = [1, 2, 3]
x.append([4, 5])
# x is [1, 2, 3, [4, 5]]
extend
does the same, but unwraps lists:
x = [1, 2, 3]
x.extend([4, 5])
# x is [1, 2, 3, 4, 5]
Both methods return None
, but you'll need a method that returns the combined list, therefore just use the plus sign.
x = [1, 2, 3] + [4, 5]
# x is [1, 2, 3, 4, 5]
Spark
file = spark.textFile("hdfs://...")
counts = file.flatMap(lambda line: line.split(" ")) \
.map(lambda actor: (actor.split(",")[0], actor)) \
# transform each value into a list
.map(lambda nameTuple: (nameTuple[0], [ nameTuple[1] ])) \
# combine lists: ([1,2,3] + [4,5]) becomes [1,2,3,4,5]
.reduceByKey(lambda a, b: a + b)
CombineByKey
It's also possible to solve this with combineByKey
, which is used internally to implement reduceByKey
, but it's more complex and "using one of the specialized per-key combiners in Spark can be much faster". Your use case is simple enough for the upper solution.
GroupByKey
It's also possible to solve this with groupByKey
, but it reduces parallelization and therefore could be much slower for big data sets.
I'm kind of late to the conversation, but here's my suggestion:
>>> foo = sc.parallelize([(1, ('a','b')), (2, ('c','d')), (1, ('x','y'))])
>>> foo.map(lambda (x,y): (x, [y])).reduceByKey(lambda p,q: p+q).collect()
[(1, [('a', 'b'), ('x', 'y')]), (2, [('c', 'd')])]
tl;dr If you really require operation like this use groupByKey
as suggested by @MariusIon. Every other solution proposed here is either bluntly inefficient are at least suboptimal compared to direct grouping.
reduceByKey
with list concatenation is not an acceptable solution because:
+
to a pair of lists requires full copy of both lists (O(N)) effectively increasing overall complexity to O(N2).groupByKey
. Amount of data that has to be shuffled as well as the size of the final structure are the same.reduceByKey
and groupByKey
.combineByKey
with list.extend
is a suboptimal solution because:
MergeValue
(this could be optimized by using list.append
directly on the new item).list.append
it is exactly equivalent to an old (Spark <= 1.3) implementation of a groupByKey
and ignores all the optimizations introduced by SPARK-3074 which enables external (on-disk) grouping of the larger-than-memory structures.Ok. I hope, I got this right. Your input is something like this:
kv_input = [("a", 1), ("a", 2), ("a", 3), ("b", 1), ("b", 5)]
and you want to get something like this:
kmv_output = [("a", [1, 2, 3]), ("b", [1, 5])]
Then this might do the job (see here):
d = dict()
for k, v in kv_input:
d.setdefault(k, list()).append(v)
kmv_output = list(d.items())
If I got this wrong, please tell me, so I might adjust this to your needs.
P.S.: a.append([b])
returns always None
. You might want to observe either [b]
or a
but not the result of append
.