Spark Advanced Window with dynamic last

前端 未结 4 1994
星月不相逢
星月不相逢 2021-02-09 11:14

Problem: Given a time series data which is a clickstream of user activity is stored in hive, ask is to enrich the data with session id using spark.

Session Definition

4条回答
  •  陌清茗
    陌清茗 (楼主)
    2021-02-09 11:57

    -----Solution without using explode----.

    `In my point of view explode is heavy process and inorder to apply you have taken groupby and collect_list.` 
    
    
    
    `
        import pyspark.sql.functions  as f
         from pyspark.sql.window import Window
        streaming_data=[("U1","2019-01-01T11:00:00Z") , 
        ("U1","2019-01-01T11:15:00Z") , 
        ("U1","2019-01-01T12:00:00Z") , 
        ("U1","2019-01-01T12:20:00Z") , 
        ("U1","2019-01-01T15:00:00Z") , 
        ("U2","2019-01-01T11:00:00Z") , 
        ("U2","2019-01-02T11:00:00Z") , 
        ("U2","2019-01-02T11:25:00Z") , 
        ("U2","2019-01-02T11:50:00Z") , 
        ("U2","2019-01-02T12:15:00Z") , 
        ("U2","2019-01-02T12:40:00Z") , 
        ("U2","2019-01-02T13:05:00Z") , 
        ("U2","2019-01-02T13:20:00Z") ]
        schema=("UserId","Click_Time")
        window_spec=Window.partitionBy("UserId").orderBy("Click_Time")
        df_stream=spark.createDataFrame(streaming_data,schema)
        df_stream=df_stream.withColumn("Click_Time",df_stream["Click_Time"].cast("timestamp"))
        
        
        df_stream=df_stream\
        .withColumn("time_diff",
                    (f.unix_timestamp("Click_Time")-f.unix_timestamp(f.lag(f.col("Click_Time"),1).over(window_spec)))/(60*60)).na.fill(0)
        
        df_stream=df_stream\
        .withColumn("cond_",f.when(f.col("time_diff")>1,1).otherwise(0))
        df_stream=df_stream.withColumn("temp_session",f.sum(f.col("cond_")).over(window_spec))
        new_spec=Window.partitionBy("UserId","temp_session").orderBy("Click_Time")
        df_stream=df_stream.withColumn("first_time_click",f.first(f.col("Click_Time")).over(new_spec))\
                           .withColumn("final_session_groups",\
                                       f.when((f.unix_timestamp(f.col("Click_Time"))-f.unix_timestamp(f.col("first_time_click")))/(2*60*60)>1,1)\
                                       .otherwise(0)).drop("first_time_click","cond_")
        df_stream=df_stream.withColumn("final_session",df_stream["temp_session"]+df_stream["final_session_groups"]+1)\
        .drop("temp_session","final_session_groups","time_diff")
        df_stream=df_stream.withColumn("session_id",f.concat(f.col("UserId"),f.lit(" session_val----->"),f.col("final_session")))
        
    df_stream.show(20,0) `
    

    ---Steps taken to solve ---

    ` 1.first find out those clickstream which are clicked less than one hour and find the continuous groups.

    2.then find out the click streams based on the 2hrs condition and make the continuous groups.

    3.Sum of these two above continuous groups and add +1 to populate the final_session column at the end of algo and do concat as per your requirement to show the session_id.`

    result will be looking like this

    `+------+---------------------+-------------+---------------------+
    |UserId|Click_Time           |final_session|session_id           |
    +------+---------------------+-------------+---------------------+
    |U2    |2019-01-01 11:00:00.0|1            |U2 session_val----->1|
    |U2    |2019-01-02 11:00:00.0|2            |U2 session_val----->2|
    |U2    |2019-01-02 11:25:00.0|2            |U2 session_val----->2|
    |U2    |2019-01-02 11:50:00.0|2            |U2 session_val----->2|
    |U2    |2019-01-02 12:15:00.0|2            |U2 session_val----->2|
    |U2    |2019-01-02 12:40:00.0|2            |U2 session_val----->2|
    |U2    |2019-01-02 13:05:00.0|3            |U2 session_val----->3|
    |U2    |2019-01-02 13:20:00.0|3            |U2 session_val----->3|
    |U1    |2019-01-01 11:00:00.0|1            |U1 session_val----->1|
    |U1    |2019-01-01 11:15:00.0|1            |U1 session_val----->1|
    |U1    |2019-01-01 12:00:00.0|2            |U1 session_val----->2|
    |U1    |2019-01-01 12:20:00.0|2            |U1 session_val----->2|
    |U1    |2019-01-01 15:00:00.0|3            |U1 session_val----->3|
    +------+---------------------+-------------+---------------------+  
    

    `

提交回复
热议问题