Spark Advanced Window with dynamic last

前端未结

关注

 4  1994

星月不相逢 2021-02-09 11:14

Problem: Given a time series data which is a clickstream of user activity is stored in hive, ask is to enrich the data with session id using spark.

Session Definition

4条回答

陌清茗 (楼主)

2021-02-09 11:57

-----Solution without using explode----.

`In my point of view explode is heavy process and inorder to apply you have taken groupby and collect_list.` 



`
    import pyspark.sql.functions  as f
     from pyspark.sql.window import Window
    streaming_data=[("U1","2019-01-01T11:00:00Z") , 
    ("U1","2019-01-01T11:15:00Z") , 
    ("U1","2019-01-01T12:00:00Z") , 
    ("U1","2019-01-01T12:20:00Z") , 
    ("U1","2019-01-01T15:00:00Z") , 
    ("U2","2019-01-01T11:00:00Z") , 
    ("U2","2019-01-02T11:00:00Z") , 
    ("U2","2019-01-02T11:25:00Z") , 
    ("U2","2019-01-02T11:50:00Z") , 
    ("U2","2019-01-02T12:15:00Z") , 
    ("U2","2019-01-02T12:40:00Z") , 
    ("U2","2019-01-02T13:05:00Z") , 
    ("U2","2019-01-02T13:20:00Z") ]
    schema=("UserId","Click_Time")
    window_spec=Window.partitionBy("UserId").orderBy("Click_Time")
    df_stream=spark.createDataFrame(streaming_data,schema)
    df_stream=df_stream.withColumn("Click_Time",df_stream["Click_Time"].cast("timestamp"))
    
    
    df_stream=df_stream\
    .withColumn("time_diff",
                (f.unix_timestamp("Click_Time")-f.unix_timestamp(f.lag(f.col("Click_Time"),1).over(window_spec)))/(60*60)).na.fill(0)
    
    df_stream=df_stream\
    .withColumn("cond_",f.when(f.col("time_diff")>1,1).otherwise(0))
    df_stream=df_stream.withColumn("temp_session",f.sum(f.col("cond_")).over(window_spec))
    new_spec=Window.partitionBy("UserId","temp_session").orderBy("Click_Time")
    df_stream=df_stream.withColumn("first_time_click",f.first(f.col("Click_Time")).over(new_spec))\
                       .withColumn("final_session_groups",\
                                   f.when((f.unix_timestamp(f.col("Click_Time"))-f.unix_timestamp(f.col("first_time_click")))/(2*60*60)>1,1)\
                                   .otherwise(0)).drop("first_time_click","cond_")
    df_stream=df_stream.withColumn("final_session",df_stream["temp_session"]+df_stream["final_session_groups"]+1)\
    .drop("temp_session","final_session_groups","time_diff")
    df_stream=df_stream.withColumn("session_id",f.concat(f.col("UserId"),f.lit(" session_val----->"),f.col("final_session")))
    
df_stream.show(20,0) `

---Steps taken to solve ---

` 1.first find out those clickstream which are clicked less than one hour and find the continuous groups.

2.then find out the click streams based on the 2hrs condition and make the continuous groups.

3.Sum of these two above continuous groups and add +1 to populate the final_session column at the end of algo and do concat as per your requirement to show the session_id.`

result will be looking like this

`+------+---------------------+-------------+---------------------+
|UserId|Click_Time           |final_session|session_id           |
+------+---------------------+-------------+---------------------+
|U2    |2019-01-01 11:00:00.0|1            |U2 session_val----->1|
|U2    |2019-01-02 11:00:00.0|2            |U2 session_val----->2|
|U2    |2019-01-02 11:25:00.0|2            |U2 session_val----->2|
|U2    |2019-01-02 11:50:00.0|2            |U2 session_val----->2|
|U2    |2019-01-02 12:15:00.0|2            |U2 session_val----->2|
|U2    |2019-01-02 12:40:00.0|2            |U2 session_val----->2|
|U2    |2019-01-02 13:05:00.0|3            |U2 session_val----->3|
|U2    |2019-01-02 13:20:00.0|3            |U2 session_val----->3|
|U1    |2019-01-01 11:00:00.0|1            |U1 session_val----->1|
|U1    |2019-01-01 11:15:00.0|1            |U1 session_val----->1|
|U1    |2019-01-01 12:00:00.0|2            |U1 session_val----->2|
|U1    |2019-01-01 12:20:00.0|2            |U1 session_val----->2|
|U1    |2019-01-01 15:00:00.0|3            |U1 session_val----->3|
+------+---------------------+-------------+---------------------+

0 讨论(0)

查看其它4个回答