How to select last row and also how to access PySpark dataframe by index?

后端 未结 4 1104
眼角桃花
眼角桃花 2020-12-10 12:27

From a PySpark SQL dataframe like

name age city
abc   20  A
def   30  B

How to get the last row.(Like by df.limit(1) I can get first row o

4条回答
  •  旧巷少年郎
    2020-12-10 13:14

    Use the following to get a index column that contains monotonically increasing, unique, and consecutive integers, which is not how monotonically_increasing_id() work. The indexes will be ascending in the same order as colName of your DataFrame.

    import pyspark.sql.functions as F
    from pyspark.sql.window import Window as W
    
    window = W.orderBy('colName').rowsBetween(W.unboundedPreceding, W.currentRow)
    
    df = df\
     .withColumn('int', F.lit(1))\
     .withColumn('index', F.sum('int').over(window))\
     .drop('int')\
    

    Use the following code to look at the tail, or last rownums of the DataFrame.

    rownums = 10
    df.where(F.col('index')>df.count()-rownums).show()
    

    Use the following code to look at the rows from start_row to end_row the DataFrame.

    start_row = 20
    end_row = start_row + 10
    df.where((F.col('index')>start_row) & (F.col('index')

    zipWithIndex() is an RDD method that does return monotonically increasing, unique, and consecutive integers, but appears to be much slower to implement in a way where you can get back to your original DataFrame amended with an id column.

提交回复
热议问题