Partition RDD into tuples of length n

后端 未结 3 1924
离开以前
离开以前 2020-12-04 00:07

I am relatively new to Apache Spark and Python and was wondering if something like what I am going to describe was doable?

I have a RDD of the form [m1, m

3条回答
  •  再見小時候
    2020-12-04 00:43

    I assume that you are using pyspark api: I don't know if it's a best possible solution for this, but i think this can be done with: zipWithIndex groupBy and simple map.

    # 3 - your grouping k
    # ci - list of tuples (char, idx)
    rdd = sc.parallelize(["a", "b", "c", "d", "e"]).zipWithIndex()\
            .groupBy(lambda (char, idx): idx/3 )\
            .map(lambda (remainder, ci):tuple([char for char, idx in ci]))\
            .collect()
    print rdd
    

    outputs:

    [('a', 'b', 'c'), ('d', 'e')]
    

    UPD: Thanks to @Rohan Aletty who corrected me.

提交回复
热议问题