Dropping the first and last row of an RDD with Spark

*爱你&永不变心* 提交于 2019-12-04 20:01:20

One way of doing this would be to zipWithIndex, and then filter out the records with indices 0 and count - 1:

// We're going to perform multiple actions on this RDD,
// so it's usually better to cache it so we don't read the file twice
rdd.cache()

// Unfortunately, we have to count() to be able to identify the last index
val count = rdd.count()
val result = rdd.zipWithIndex().collect {
  case (v, index) if index != 0 && index != count - 1 => v
}

Do note that this might be be rather costly in terms of performance (if you cache the RDD - you use up memory; If you don't, you read the RDD twice). So, if you have any way of identifying these records based on their contents (e.g. if you know all records but these should contain a certain pattern), using filter would probably be faster.

This might be a lighter version:

val rdd = sc.parallelize(Array(1,2,3,4,5,6), 3)
val partitions = rdd.getNumPartitions
val rddFirstLast = rdd.mapPartitionsWithIndex { (idx, iter) =>
  if (idx == 0) iter.drop(1)
  else if (idx == partitions - 1) iter.sliding(2).map(_.head)
  else iter
}

scala> rddFirstLast.collect()
res3: Array[Int] = Array(2, 3, 4, 5)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!