I have a table with following structure
USER_ID Tweet_ID Date 1 1001 Thu Aug 05 19:11:39 +0000 2010 1 6022
Another way could be:
from pyspark.sql.functions import lag from pyspark.sql.window import Window df.withColumn("time_intertweet",(df.date.cast("bigint") - lag(df.date.cast("bigint"), 1) .over(Window.partitionBy("user_id") .orderBy("date"))) .cast("bigint"))