Date difference between consecutive rows - Pyspark Dataframe

前端 未结 3 1357
無奈伤痛
無奈伤痛 2020-12-11 02:40

I have a table with following structure

USER_ID     Tweet_ID                 Date
  1           1001       Thu Aug 05 19:11:39 +0000 2010
  1           6022          


        
3条回答
  •  北海茫月
    2020-12-11 03:12

    Another way could be:

    from pyspark.sql.functions import lag
    from pyspark.sql.window import Window
    
    df.withColumn("time_intertweet",(df.date.cast("bigint") - lag(df.date.cast("bigint"), 1)
    .over(Window.partitionBy("user_‌​id")
    .orderBy("date")‌​))
    .cast("bigint"))
    

提交回复
热议问题