How to select last row and also how to access PySpark dataframe by index?

后端 未结 4 1098
眼角桃花
眼角桃花 2020-12-10 12:27

From a PySpark SQL dataframe like

name age city
abc   20  A
def   30  B

How to get the last row.(Like by df.limit(1) I can get first row o

相关标签:
4条回答
  • 2020-12-10 12:56

    How to get the last row.

    If you have a column that you can use to order dataframe, for example "index", then one easy way to get the last record is using SQL: 1) order your table by descending order and 2) take 1st value from this order

    df.createOrReplaceTempView("table_df")
    query_latest_rec = """SELECT * FROM table_df ORDER BY index DESC limit 1"""
    latest_rec = self.sqlContext.sql(query_latest_rec)
    latest_rec.show()
    

    And how can I access the dataframe rows by index.like row no. 12 or 200 .

    Similar way you can get record in any line

    row_number = 12
    df.createOrReplaceTempView("table_df")
    query_latest_rec = """SELECT * FROM (select * from table_df ORDER BY index ASC limit {0}) ord_lim ORDER BY index DESC limit 1"""
    latest_rec = self.sqlContext.sql(query_latest_rec.format(row_number))
    latest_rec.show()
    

    If you do not have "index" column you can create it using

    from pyspark.sql.functions import monotonically_increasing_id
    
    df = df.withColumn("index", monotonically_increasing_id())
    
    0 讨论(0)
  • 2020-12-10 12:59

    How to get the last row.

    Long and ugly way which assumes that all columns are oderable:

    from pyspark.sql.functions import (
        col, max as max_, struct, monotonically_increasing_id
    )
    
    last_row = (df
        .withColumn("_id", monotonically_increasing_id())
        .select(max(struct("_id", *df.columns))
        .alias("tmp")).select(col("tmp.*"))
        .drop("_id"))
    

    If not all columns can be order you can try:

    with_id = df.withColumn("_id", monotonically_increasing_id())
    i = with_id.select(max_("_id")).first()[0]
    
    with_id.where(col("_id") == i).drop("_id")
    

    Note. There is last function in pyspark.sql.functions/ `o.a.s.sql.functions but considering description of the corresponding expressions it is not a good choice here.

    how can I access the dataframe rows by index.like

    You cannot. Spark DataFrame and accessible by index. You can add indices using zipWithIndex and filter later. Just keep in mind this O(N) operation.

    0 讨论(0)
  • 2020-12-10 13:14

    Use the following to get a index column that contains monotonically increasing, unique, and consecutive integers, which is not how monotonically_increasing_id() work. The indexes will be ascending in the same order as colName of your DataFrame.

    import pyspark.sql.functions as F
    from pyspark.sql.window import Window as W
    
    window = W.orderBy('colName').rowsBetween(W.unboundedPreceding, W.currentRow)
    
    df = df\
     .withColumn('int', F.lit(1))\
     .withColumn('index', F.sum('int').over(window))\
     .drop('int')\
    

    Use the following code to look at the tail, or last rownums of the DataFrame.

    rownums = 10
    df.where(F.col('index')>df.count()-rownums).show()
    

    Use the following code to look at the rows from start_row to end_row the DataFrame.

    start_row = 20
    end_row = start_row + 10
    df.where((F.col('index')>start_row) & (F.col('index')<end_row)).show()
    

    zipWithIndex() is an RDD method that does return monotonically increasing, unique, and consecutive integers, but appears to be much slower to implement in a way where you can get back to your original DataFrame amended with an id column.

    0 讨论(0)
  • 2020-12-10 13:18
    from pyspark.sql import functions as F
    
    expr = [F.last(col).alias(col) for col in df.columns]
    
    df.agg(*expr)
    

    Just a tip: Looks like you still have the mindset of someone who is working with pandas or R. Spark is a different paradigm in the way we work with data. You don't access data inside individual cells anymore, now you work with whole chunks of it. If you keep collecting stuff and doing actions, like you just did, you lose the whole concept of parallelism that spark provide. Take a look on the concept of transformations vs actions in Spark.

    0 讨论(0)
提交回复
热议问题