I am trying to solve the age-old problem of adding a sequence number to a data set. I am working with DataFrames, and there appears to be no DataFrame equivalent to RD
Starting in Spark 1.5, Window
expressions were added to Spark. Instead of having to convert the DataFrame
to an RDD
, you can now use org.apache.spark.sql.expressions.row_number
. Note that I found performance for the the above dfZipWithIndex
to be significantly faster than the below algorithm. But I am posting it because:
At any rate, here's what works for me:
import org.apache.spark.sql.expressions._
df.withColumn("row_num", row_number.over(Window.partitionBy(lit(1)).orderBy(lit(1))))
Note that I use lit(1)
for both the partitioning and the ordering -- this makes everything be in the same partition, and seems to preserve the original ordering of the DataFrame
, but I suppose it is what slows it way down.
I tested it on a 4-column DataFrame
with 7,000,000 rows and the speed difference is significant between this and the above dfZipWithIndex
(like I said, the RDD
functions is much, much faster).