Apache Spark RDD filter into two RDDs

大憨熊 提交于 2019-11-27 23:32:01
Marius Soutier

Spark doesn't support this by default. Filtering on the same data twice isn't that bad if you cache it beforehand, and the filtering itself is quick.

If it's really just two different types, you can use a helper method:

implicit class RDDOps[T](rdd: RDD[T]) {
  def partitionBy(f: T => Boolean): (RDD[T], RDD[T]) = {
    val passes = rdd.filter(f)
    val fails = rdd.filter(e => !f(e)) // Spark doesn't have filterNot
    (passes, fails)
  }
}

val (matches, matchesNot) = sc.parallelize(1 to 100).cache().partitionBy(_ % 2 == 0)

But as soon as you have multiple types of data, just assign the filtered to a new val.

Spark RDD does not have such api.

Here is a version based on a pull request for rdd.span that should work:

import scala.reflect.ClassTag
import org.apache.spark.rdd._

def split[T:ClassTag](rdd: RDD[T], p: T => Boolean): (RDD[T], RDD[T]) = {

    val splits = rdd.mapPartitions { iter =>
        val (left, right) = iter.partition(p)
        val iterSeq = Seq(left, right)
        iterSeq.iterator
    }

    val left = splits.mapPartitions { iter => iter.next().toIterator}

    val right = splits.mapPartitions { iter => 
        iter.next()
        iter.next().toIterator
    }
    (left, right)
}

val rdd = sc.parallelize(0 to 10, 2)

val (first, second) = split[Int](rdd, _ % 2 == 0 )

first.collect
// Array[Int] = Array(0, 2, 4, 6, 8, 10)

The point is, you do not want to do a filter, but a map.

(T) -> (Boolean, T)

Sorry, I am inefficient in Scala Syntax. But the idea is that you split your answer set by mapping it to Key/Value pairs. The Key can be a boolean indicating wether or not it was passing the 'Filter' predicate.

You can control output to different targets by doing partition wise processing. Just make sure that you don’t restrict parallel processing to just two partitions downstream.

See also How do I split an RDD into two or more RDDs?

Justin Pihony

If you are ok with a T instead of an RDD[T], then you can do this. Otherwise, you could maybe do something like this:

val data = sc.parallelize(1 to 100)
val splitData = data.mapPartitions{iter => {
    val splitList = (iter.toList).partition(_%2 == 0)
    Tuple1(splitList).productIterator
  }
}.map(_.asInstanceOf[Tuple2[List[Int],List[Int]]])

And, then you will probably need to reduce this down to merge the lists when you go to perform an action

Priyanshu Ranjan

You can use subtract function (If filter operation is too expensive).

PySpark code:

rdd1 = data.filter(filterFunction)

rdd2 = data.subtract(rdd1)
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!