Hive query generating identifiers for a sequence of row matching a condition

前端 未结 3 1139
执笔经年
执笔经年 2021-01-23 11:54

Let\'s say I have the following hive table as input, let\'s call it connections:

userid  | timestamp   
--------|-------------
1       | 1433258019          


        
3条回答
  •  野性不改
    2021-01-23 11:57

    Interesting question. Per your comment to @Madhu, I added the line 2 1433258172 to your example. What you need is to increment every time timediff > 60 is satisfied. The easiest way to do this is to flag it and then cumulatively sum over the window.

    Query:

    select userid
      , timestamp
      , concat('user', userid, '-session-', s_sum) sessionid
    from (
      select *
        , sum( counter ) over (partition by userid
                               order by timestamp asc
                               rows between unbounded preceding and current row) s_sum
      from (
        select *
          , case when timediff > 60 then 1 else 0 end as counter
        from (
          select userid
            , timestamp
            , timestamp - lag(timestamp, 1, 0) over (partition by userid
                                                     order by timestamp asc) timediff
          from connections ) x ) y ) z
    

    Output:

    1   1433258019  user1-session-1
    1   1433258020  user1-session-1
    2   1433258080  user2-session-1
    2   1433258083  user2-session-1
    2   1433258088  user2-session-1
    2   1433258170  user2-session-2
    2   1433258172  user2-session-2
    

提交回复
热议问题