DataFrame-ified zipWithIndex

后端 未结 8 1488
悲哀的现实
悲哀的现实 2020-11-27 04:23

I am trying to solve the age-old problem of adding a sequence number to a data set. I am working with DataFrames, and there appears to be no DataFrame equivalent to RD

8条回答
  •  不知归路
    2020-11-27 04:38

    The following was posted on behalf of the David Griffin (edited out of question).

    The all-singing, all-dancing dfZipWithIndex method. You can set the starting offset (which defaults to 1), the index column name (defaults to "id"), and place the column in the front or the back:

    import org.apache.spark.sql.DataFrame
    import org.apache.spark.sql.types.{LongType, StructField, StructType}
    import org.apache.spark.sql.Row
    
    
    def dfZipWithIndex(
      df: DataFrame,
      offset: Int = 1,
      colName: String = "id",
      inFront: Boolean = true
    ) : DataFrame = {
      df.sqlContext.createDataFrame(
        df.rdd.zipWithIndex.map(ln =>
          Row.fromSeq(
            (if (inFront) Seq(ln._2 + offset) else Seq())
              ++ ln._1.toSeq ++
            (if (inFront) Seq() else Seq(ln._2 + offset))
          )
        ),
        StructType(
          (if (inFront) Array(StructField(colName,LongType,false)) else Array[StructField]()) 
            ++ df.schema.fields ++ 
          (if (inFront) Array[StructField]() else Array(StructField(colName,LongType,false)))
        )
      ) 
    }
    

提交回复
热议问题