Partition RDD into tuples of length n

后端 未结 3 1925
离开以前
离开以前 2020-12-04 00:07

I am relatively new to Apache Spark and Python and was wondering if something like what I am going to describe was doable?

I have a RDD of the form [m1, m

3条回答
  •  再見小時候
    2020-12-04 00:31

    Olologin's answer almost has it but I believe what you are trying to do is group your RDD into 3-tuples instead of grouping your RDD into 3 groups of tuples. To do the former, try the following:

    rdd = sc.parallelize(["e1", "e2", "e3", "e4", "e5", "e6", "e7", "e8", "e9", "e10"])
    transformed = rdd.zipWithIndex().groupBy(lambda (_, i): i / 3)
                     .map(lambda (_, list): tuple([elem[0] for elem in list]))
    

    When run in pyspark, I get the following:

    >>> from __future__ import print_function    
    >>> rdd = sc.parallelize(["e1", "e2", "e3", "e4", "e5", "e6", "e7", "e8", "e9", "e10"])
    >>> transformed = rdd.zipWithIndex().groupBy(lambda (_, i): i / 3).map(lambda (_, list): tuple([elem[0] for elem in list]))
    >>> transformed.foreach(print)
    ...
    ('e4', 'e5', 'e6')
    ('e10',)
    ('e7', 'e8', 'e9')
    ('e1', 'e2', 'e3')
    

提交回复
热议问题