Spark FlatMap function for huge lists

前端 未结 1 2009
心在旅途
心在旅途 2020-12-06 08:45

I have a very basic question. Spark\'s flatMap function allows you the emit 0,1 or more outputs per input. So the (lambda) function you feed to flatMap should r

相关标签:
1条回答
  • 2020-12-06 09:21

    So the (lambda) function you feed to flatMap should return a list.

    No, it doesn't have to return list. In practice you can easily use a lazy sequence. It is probably easier to spot when take a look at the Scala RDD.flatMap signature:

    flatMap[U](f: (T) ⇒ TraversableOnce[U])
    

    Since subclasses of TraversableOnce include SeqView or Stream you can use a lazy sequence instead of a List. For example:

    val rdd = sc.parallelize("foo" :: "bar" :: Nil)
    rdd.flatMap {x => (1 to 1000000000).view.map {
        _ => (x, scala.util.Random.nextLong)
    }}
    

    Since you've mentioned lambda function I assume you're using PySpark. The simplest thing you can do is to return a generator instead of list:

    import numpy as np
    
    rdd = sc.parallelize(["foo", "bar"])
    rdd.flatMap(lambda x: ((x, np.random.randint(1000)) for _ in xrange(100000000)))
    

    Since RDDs are lazily evaluated it is even possible to return an infinite sequence from the flatMap. Using a little bit of toolz power:

    from toolz.itertoolz import iterate
    def inc(x):
        return x + 1
    
    rdd.flatMap(lambda x: ((i, x) for i in iterate(inc, 0))).take(1)
    
    0 讨论(0)
提交回复
热议问题