Is there any Spark function that allows to split a collection into several RDDs according to some creteria? Such function would allow to avoid excessive itteration. For exam
Maybe something like this would work:
def singlePassMultiFilter[T](
rdd: RDD[T],
f1: T => Boolean,
f2: T => Boolean,
level: StorageLevel = StorageLevel.MEMORY_ONLY
): (RDD[T], RDD[T], Boolean => Unit) = {
val tempRDD = rdd mapPartitions { iter =>
val abuf1 = ArrayBuffer.empty[T]
val abuf2 = ArrayBuffer.empty[T]
for (x <- iter) {
if (f1(x)) abuf1 += x
if (f2(x)) abuf2 += x
}
Iterator.single((abuf1, abuf2))
}
tempRDD.persist(level)
val rdd1 = tempRDD.flatMap(_._1)
val rdd2 = tempRDD.flatMap(_._2)
(rdd1, rdd2, (blocking: Boolean) => tempRDD.unpersist(blocking))
}
Note that an action called on rdd1
(resp. rdd2
) will cause tempRDD to be computed and persisted. This is practically equivalent to computing rdd2
(resp. rdd1
) since the overhead of the flatMap
in the definitions of rdd1
and rdd2
are, I believe, going to be pretty negligible.
You would use singlePassMultiFitler
like so:
val (rdd1, rdd2, cleanUp) = singlePassMultiFilter(rdd, f1, f2)
rdd1.persist() //I'm going to need `rdd1` more later...
println(rdd1.count)
println(rdd2.count)
cleanUp(true) //I'm done with `rdd2` and `rdd1` has been persisted so free stuff up...
println(rdd1.distinct.count)
Clearly this could extended to an arbitrary number of filters, collections of filters, etc.