I am relatively new to Apache Spark and Python and was wondering if something like what I am going to describe was doable?
I have a RDD of the form [m1, m
I assume that you are using pyspark api: I don't know if it's a best possible solution for this, but i think this can be done with: zipWithIndex groupBy and simple map.
# 3 - your grouping k
# ci - list of tuples (char, idx)
rdd = sc.parallelize(["a", "b", "c", "d", "e"]).zipWithIndex()\
.groupBy(lambda (char, idx): idx/3 )\
.map(lambda (remainder, ci):tuple([char for char, idx in ci]))\
.collect()
print rdd
outputs:
[('a', 'b', 'c'), ('d', 'e')]
UPD: Thanks to @Rohan Aletty who corrected me.