Scala Spark: Split collection into several RDD?

前端 未结 2 2096
轻奢々
轻奢々 2020-12-06 07:14

Is there any Spark function that allows to split a collection into several RDDs according to some creteria? Such function would allow to avoid excessive itteration. For exam

2条回答
  •  轻奢々
    轻奢々 (楼主)
    2020-12-06 08:03

    Maybe something like this would work:

    def singlePassMultiFilter[T](
          rdd: RDD[T],
          f1: T => Boolean,
          f2: T => Boolean,
          level: StorageLevel = StorageLevel.MEMORY_ONLY
      ): (RDD[T], RDD[T], Boolean => Unit) = {
      val tempRDD = rdd mapPartitions { iter =>
        val abuf1 = ArrayBuffer.empty[T]
        val abuf2 = ArrayBuffer.empty[T]
        for (x <- iter) {
          if (f1(x)) abuf1 += x
          if (f2(x)) abuf2 += x
        }
        Iterator.single((abuf1, abuf2))
      }
      tempRDD.persist(level)
      val rdd1 = tempRDD.flatMap(_._1)
      val rdd2 = tempRDD.flatMap(_._2)
      (rdd1, rdd2, (blocking: Boolean) => tempRDD.unpersist(blocking))
    }
    

    Note that an action called on rdd1 (resp. rdd2) will cause tempRDD to be computed and persisted. This is practically equivalent to computing rdd2 (resp. rdd1) since the overhead of the flatMap in the definitions of rdd1 and rdd2 are, I believe, going to be pretty negligible.

    You would use singlePassMultiFitler like so:

    val (rdd1, rdd2, cleanUp) = singlePassMultiFilter(rdd, f1, f2)
    rdd1.persist()    //I'm going to need `rdd1` more later...
    println(rdd1.count)  
    println(rdd2.count) 
    cleanUp(true)     //I'm done with `rdd2` and `rdd1` has been persisted so free stuff up...
    println(rdd1.distinct.count)
    

    Clearly this could extended to an arbitrary number of filters, collections of filters, etc.

提交回复
热议问题