DataFrame-ified zipWithIndex

后端 未结 8 1456
悲哀的现实
悲哀的现实 2020-11-27 04:23

I am trying to solve the age-old problem of adding a sequence number to a data set. I am working with DataFrames, and there appears to be no DataFrame equivalent to RD

8条回答
  •  一整个雨季
    2020-11-27 04:47

    I have modified @Tagar's version to run on Python 3.7, wanted to share:

    def dfZipWithIndex (df, offset=1, colName="rowId"):
    '''
        Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
        and preserves a schema
    
        :param df: source dataframe
        :param offset: adjustment to zipWithIndex()'s index
        :param colName: name of the index column
    '''
    
    new_schema = StructType(
                    [StructField(colName,LongType(),True)]        # new added field in front
                    + df.schema.fields                            # previous schema
                )
    
    zipped_rdd = df.rdd.zipWithIndex()
    
    new_rdd = zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))      # use this for python 3+, tuple gets passed as single argument so using args and [] notation to read elements within args
    return spark.createDataFrame(new_rdd, new_schema)
    

提交回复
热议问题