How to select last row and also how to access PySpark dataframe by index?

后端 未结 4 1101
眼角桃花
眼角桃花 2020-12-10 12:27

From a PySpark SQL dataframe like

name age city
abc   20  A
def   30  B

How to get the last row.(Like by df.limit(1) I can get first row o

4条回答
  •  春和景丽
    2020-12-10 12:59

    How to get the last row.

    Long and ugly way which assumes that all columns are oderable:

    from pyspark.sql.functions import (
        col, max as max_, struct, monotonically_increasing_id
    )
    
    last_row = (df
        .withColumn("_id", monotonically_increasing_id())
        .select(max(struct("_id", *df.columns))
        .alias("tmp")).select(col("tmp.*"))
        .drop("_id"))
    

    If not all columns can be order you can try:

    with_id = df.withColumn("_id", monotonically_increasing_id())
    i = with_id.select(max_("_id")).first()[0]
    
    with_id.where(col("_id") == i).drop("_id")
    

    Note. There is last function in pyspark.sql.functions/ `o.a.s.sql.functions but considering description of the corresponding expressions it is not a good choice here.

    how can I access the dataframe rows by index.like

    You cannot. Spark DataFrame and accessible by index. You can add indices using zipWithIndex and filter later. Just keep in mind this O(N) operation.

提交回复
热议问题