Date difference between consecutive rows - Pyspark Dataframe

前端 未结 3 1358
無奈伤痛
無奈伤痛 2020-12-11 02:40

I have a table with following structure

USER_ID     Tweet_ID                 Date
  1           1001       Thu Aug 05 19:11:39 +0000 2010
  1           6022          


        
相关标签:
3条回答
  • 2020-12-11 03:12

    Another way could be:

    from pyspark.sql.functions import lag
    from pyspark.sql.window import Window
    
    df.withColumn("time_intertweet",(df.date.cast("bigint") - lag(df.date.cast("bigint"), 1)
    .over(Window.partitionBy("user_‌​id")
    .orderBy("date")‌​))
    .cast("bigint"))
    
    0 讨论(0)
  • 2020-12-11 03:13

    EDITED thanks to @cool_kid

    @Joesemy answer is really good but didn't work for me since cast("bigint") threw an error. So I used the datediff function from the pyspark.sql.functions module this way and it worked :

    from pyspark.sql.functions import *
    from pyspark.sql.window import Window
    
    df.withColumn("time_intertweet", datediff(df.date, lag(df.date, 1)
        .over(Window.partitionBy("user_‌​id")
        .orderBy("date")‌​)))
    
    0 讨论(0)
  • 2020-12-11 03:19

    Like this:

    df.registerTempTable("df")
    
    sqlContext.sql("""
         SELECT *, CAST(date AS bigint) - CAST(lag(date, 1) OVER (
                  PARTITION BY user_id ORDER BY date) AS bigint) 
         FROM df""")
    
    0 讨论(0)
提交回复
热议问题