Problem: Given a time series data which is a clickstream of user activity is stored in hive, ask is to enrich the data with session id using spark.
Session Definition
-----Solution without using explode----.
`In my point of view explode is heavy process and inorder to apply you have taken groupby and collect_list.`
`
import pyspark.sql.functions as f
from pyspark.sql.window import Window
streaming_data=[("U1","2019-01-01T11:00:00Z") ,
("U1","2019-01-01T11:15:00Z") ,
("U1","2019-01-01T12:00:00Z") ,
("U1","2019-01-01T12:20:00Z") ,
("U1","2019-01-01T15:00:00Z") ,
("U2","2019-01-01T11:00:00Z") ,
("U2","2019-01-02T11:00:00Z") ,
("U2","2019-01-02T11:25:00Z") ,
("U2","2019-01-02T11:50:00Z") ,
("U2","2019-01-02T12:15:00Z") ,
("U2","2019-01-02T12:40:00Z") ,
("U2","2019-01-02T13:05:00Z") ,
("U2","2019-01-02T13:20:00Z") ]
schema=("UserId","Click_Time")
window_spec=Window.partitionBy("UserId").orderBy("Click_Time")
df_stream=spark.createDataFrame(streaming_data,schema)
df_stream=df_stream.withColumn("Click_Time",df_stream["Click_Time"].cast("timestamp"))
df_stream=df_stream\
.withColumn("time_diff",
(f.unix_timestamp("Click_Time")-f.unix_timestamp(f.lag(f.col("Click_Time"),1).over(window_spec)))/(60*60)).na.fill(0)
df_stream=df_stream\
.withColumn("cond_",f.when(f.col("time_diff")>1,1).otherwise(0))
df_stream=df_stream.withColumn("temp_session",f.sum(f.col("cond_")).over(window_spec))
new_spec=Window.partitionBy("UserId","temp_session").orderBy("Click_Time")
df_stream=df_stream.withColumn("first_time_click",f.first(f.col("Click_Time")).over(new_spec))\
.withColumn("final_session_groups",\
f.when((f.unix_timestamp(f.col("Click_Time"))-f.unix_timestamp(f.col("first_time_click")))/(2*60*60)>1,1)\
.otherwise(0)).drop("first_time_click","cond_")
df_stream=df_stream.withColumn("final_session",df_stream["temp_session"]+df_stream["final_session_groups"]+1)\
.drop("temp_session","final_session_groups","time_diff")
df_stream=df_stream.withColumn("session_id",f.concat(f.col("UserId"),f.lit(" session_val----->"),f.col("final_session")))
df_stream.show(20,0) `
---Steps taken to solve ---
` 1.first find out those clickstream which are clicked less than one hour and find the continuous groups.
2.then find out the click streams based on the 2hrs condition and make the continuous groups.
3.Sum of these two above continuous groups and add +1 to populate the final_session column at the end of algo and do concat as per your requirement to show the session_id.`
result will be looking like this
`+------+---------------------+-------------+---------------------+
|UserId|Click_Time |final_session|session_id |
+------+---------------------+-------------+---------------------+
|U2 |2019-01-01 11:00:00.0|1 |U2 session_val----->1|
|U2 |2019-01-02 11:00:00.0|2 |U2 session_val----->2|
|U2 |2019-01-02 11:25:00.0|2 |U2 session_val----->2|
|U2 |2019-01-02 11:50:00.0|2 |U2 session_val----->2|
|U2 |2019-01-02 12:15:00.0|2 |U2 session_val----->2|
|U2 |2019-01-02 12:40:00.0|2 |U2 session_val----->2|
|U2 |2019-01-02 13:05:00.0|3 |U2 session_val----->3|
|U2 |2019-01-02 13:20:00.0|3 |U2 session_val----->3|
|U1 |2019-01-01 11:00:00.0|1 |U1 session_val----->1|
|U1 |2019-01-01 11:15:00.0|1 |U1 session_val----->1|
|U1 |2019-01-01 12:00:00.0|2 |U1 session_val----->2|
|U1 |2019-01-01 12:20:00.0|2 |U1 session_val----->2|
|U1 |2019-01-01 15:00:00.0|3 |U1 session_val----->3|
+------+---------------------+-------------+---------------------+
`