How to select last row and also how to access PySpark dataframe by index?

后端未结

关注

 4  1110

眼角桃花 2020-12-10 12:27

From a PySpark SQL dataframe like

name age city
abc   20  A
def   30  B

How to get the last row.(Like by df.limit(1) I can get first row o

4条回答

旧巷少年郎 (楼主)

2020-12-10 13:14
Use the following to get a index column that contains monotonically increasing, unique, and consecutive integers, which is not how monotonically_increasing_id() work. The indexes will be ascending in the same order as colName of your DataFrame.
```
import pyspark.sql.functions as F
from pyspark.sql.window import Window as W

window = W.orderBy('colName').rowsBetween(W.unboundedPreceding, W.currentRow)

df = df\
 .withColumn('int', F.lit(1))\
 .withColumn('index', F.sum('int').over(window))\
 .drop('int')\
```
Use the following code to look at the tail, or last rownums of the DataFrame.
```
rownums = 10
df.where(F.col('index')>df.count()-rownums).show()
```
Use the following code to look at the rows from start_row to end_row the DataFrame.
```
start_row = 20
end_row = start_row + 10
df.where((F.col('index')>start_row) & (F.col('index')
```
zipWithIndex() is an RDD method that does return monotonically increasing, unique, and consecutive integers, but appears to be much slower to implement in a way where you can get back to your original DataFrame amended with an id column.
0 讨论(0) 查看其它4个回答发布评论: 提交评论加载中...