DataFrame-ified zipWithIndex

后端 未结 8 1454
悲哀的现实
悲哀的现实 2020-11-27 04:23

I am trying to solve the age-old problem of adding a sequence number to a data set. I am working with DataFrames, and there appears to be no DataFrame equivalent to RD

8条回答
  •  攒了一身酷
    2020-11-27 04:32

    Starting in Spark 1.5, Window expressions were added to Spark. Instead of having to convert the DataFrame to an RDD, you can now use org.apache.spark.sql.expressions.row_number. Note that I found performance for the the above dfZipWithIndex to be significantly faster than the below algorithm. But I am posting it because:

    1. Someone else is going to be tempted to try this
    2. Maybe someone can optimize the expressions below

    At any rate, here's what works for me:

    import org.apache.spark.sql.expressions._
    
    df.withColumn("row_num", row_number.over(Window.partitionBy(lit(1)).orderBy(lit(1))))
    

    Note that I use lit(1) for both the partitioning and the ordering -- this makes everything be in the same partition, and seems to preserve the original ordering of the DataFrame, but I suppose it is what slows it way down.

    I tested it on a 4-column DataFrame with 7,000,000 rows and the speed difference is significant between this and the above dfZipWithIndex (like I said, the RDD functions is much, much faster).

提交回复
热议问题