How to select last row and also how to access PySpark dataframe by index?

后端 未结 4 1100
眼角桃花
眼角桃花 2020-12-10 12:27

From a PySpark SQL dataframe like

name age city
abc   20  A
def   30  B

How to get the last row.(Like by df.limit(1) I can get first row o

4条回答
  •  無奈伤痛
    2020-12-10 12:56

    How to get the last row.

    If you have a column that you can use to order dataframe, for example "index", then one easy way to get the last record is using SQL: 1) order your table by descending order and 2) take 1st value from this order

    df.createOrReplaceTempView("table_df")
    query_latest_rec = """SELECT * FROM table_df ORDER BY index DESC limit 1"""
    latest_rec = self.sqlContext.sql(query_latest_rec)
    latest_rec.show()
    

    And how can I access the dataframe rows by index.like row no. 12 or 200 .

    Similar way you can get record in any line

    row_number = 12
    df.createOrReplaceTempView("table_df")
    query_latest_rec = """SELECT * FROM (select * from table_df ORDER BY index ASC limit {0}) ord_lim ORDER BY index DESC limit 1"""
    latest_rec = self.sqlContext.sql(query_latest_rec.format(row_number))
    latest_rec.show()
    

    If you do not have "index" column you can create it using

    from pyspark.sql.functions import monotonically_increasing_id
    
    df = df.withColumn("index", monotonically_increasing_id())
    

提交回复
热议问题