I have data like this:
df = sqlContext.createDataFrame([
(\'1986/10/15\', \'z\', \'null\'),
(\'1986/10/15\', \'z\', \'null\'),
(\'1986/10/15\'
Alternatively, how to find the number of days passed between two subsequent user's actions using pySpark:
import pyspark.sql.functions as funcs
from pyspark.sql.window import Window
window = Window.partitionBy('user_id').orderBy('action_date')
df = df.withColumn("days_passed", funcs.datediff(df.action_date,
funcs.lag(df.action_date, 1).over(window)))
+----------+-----------+-----------+
| user_id|action_date|days_passed|
+----------+-----------+-----------+
|623 |2015-10-21| null|
|623 |2015-11-19| 29|
|623 |2016-01-13| 59|
|623 |2016-01-21| 8|
|623 |2016-03-24| 63|
+----------+----------+------------+