发表新帖

发表新帖

Partition RDD into tuples of length n

后端未结

关注

 3  1938

离开以前 2020-12-04 00:07

I am relatively new to Apache Spark and Python and was wondering if something like what I am going to describe was doable?

I have a RDD of the form [m₁, m

3条回答

再見小時候 (楼主)

2020-12-04 00:43
I assume that you are using pyspark api: I don't know if it's a best possible solution for this, but i think this can be done with: zipWithIndex groupBy and simple map.
```
# 3 - your grouping k
# ci - list of tuples (char, idx)
rdd = sc.parallelize(["a", "b", "c", "d", "e"]).zipWithIndex()\
        .groupBy(lambda (char, idx): idx/3 )\
        .map(lambda (remainder, ci):tuple([char for char, idx in ci]))\
        .collect()
print rdd
```
outputs:
```
[('a', 'b', 'c'), ('d', 'e')]
```
UPD: Thanks to @Rohan Aletty who corrected me.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题