Dropping the first and last row of an RDD with Spark

I'm reading in a text file using spark with sc.textFile(fileLocation) and need to be able to quickly drop the first and last row (they could be a header or trailer). I've found good ways of returning the first and last row, but no good one for removing them. Is this possible?

One way of doing this would be to zipWithIndex, and then filter out the records with indices 0 and count - 1:

// We're going to perform multiple actions on this RDD,
// so it's usually better to cache it so we don't read the file twice
rdd.cache()

// Unfortunately, we have to count() to be able to identify the last index
val count = rdd.count()
val result = rdd.zipWithIndex().collect {
  case (v, index) if index != 0 && index != count - 1 => v
}

Do note that this might be be rather costly in terms of performance (if you cache the RDD - you use up memory; If you don't, you read the RDD twice). So, if you have any way of identifying these records based on their contents (e.g. if you know all records but these should contain a certain pattern), using filter would probably be faster.

This might be a lighter version:

val rdd = sc.parallelize(Array(1,2,3,4,5,6), 3)
val partitions = rdd.getNumPartitions
val rddFirstLast = rdd.mapPartitionsWithIndex { (idx, iter) =>
  if (idx == 0) iter.drop(1)
  else if (idx == partitions - 1) iter.sliding(2).map(_.head)
  else iter
}

scala> rddFirstLast.collect()
res3: Array[Int] = Array(2, 3, 4, 5)

来源：https://stackoverflow.com/questions/45105739/dropping-the-first-and-last-row-of-an-rdd-with-spark

标签

scala

apache-spark

rdd