How Can I Obtain an Element Position in Spark's RDD?

我只是一个虾纸丫 提交于 2019-11-30 06:53:51

Essentially, RDD's zipWithIndex() method seems to do this, but it won't preserve the original ordering of the data the RDD was created from. At least you'll get a stable ordering.

val orig: RDD[String] = ...
val indexed: RDD[(String, Long)] = orig.zipWithIndex()

The reason you're unlikely to find something that preserves the order in the original data is buried in the API doc for zipWithIndex():

"Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type. This method needs to trigger a spark job when this RDD contains more than one partitions."

So it looks like the original order is discarded. If preserving the original order is important to you, it looks like you need to add the index before you create the RDD.

I believe in most cases, zipWithIndex() will do the trick, and it will preserve the order. Read the comments again. My understanding is that it exactly means keep the order in the RDD.

scala> val r1 = sc.parallelize(List("a", "b", "c", "d", "e", "f", "g"), 3)
scala> val r2 = r1.zipWithIndex
scala> r2.foreach(println)
(c,2)
(d,3)
(e,4)
(f,5)
(g,6)
(a,0)
(b,1)

Above example confirm it. The red has 3 partitions, and a with index 0, b with index 1, etc.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!