hive的lead、lag函数定义与使用
lag和lead分析函数可以在同一次查询中取出同一字段的前n行的数据(lag)和后n行的数据(lead)作为独立的列。
这种操作可以代替表的自联接,并且lag和lead有更高的效率,其中over()表示当前查询的结果集对象,括号里面的语句则表示对这个结果集进行处理。
函数介绍
lag
lag(col,n,default) 用于统计窗口内往上第n行值
参数1为列名,参数2为往上第n行(可选,默认为1),参数3为默认值(当往上第n行为null时候,取默认值,如不指定,则为null)
lead
与lag相反
lead(col,n,default) 用于统计窗口内往下第n行值
参数1为列名,参数2为往下第n行(可选,默认为1),参数3为默认值(当往下第n行为null时候,取默认值,如不指定,则为null)
lead的用法
用户peter在浏览网页,在某个时刻,peter点进了某个页面,过一段时间后,peter又进入了另外一个页面,如此反复,那怎么去统计peter在某个特定网页的停留时间呢,又或是怎么统计某个网页用户停留的总时间呢?
数据准备
现在用户的行为都被采集了,处理转换到hive数据表,表结构如下:
create table test.user_log(
userid string,
time string,
url string
) row format delimited fields terminated by ',';
记录数据:
+------------------+----------------------+---------------+--+
| user_log.userid | user_log.time | user_log.url |
+------------------+----------------------+---------------+--+
| peter | 2015-10-12 01:10:00 | url1 |
| peter | 2015-10-12 01:15:10 | url2 |
| peter | 2015-10-12 01:16:40 | url3 |
| peter | 2015-10-12 02:13:00 | url4 |
| peter | 2015-10-12 03:14:30 | url5 |
| marry | 2015-11-12 01:10:00 | url1 |
| marry | 2015-11-12 01:15:10 | url2 |
| marry | 2015-11-12 01:16:40 | url3 |
| marry | 2015-11-12 02:13:00 | url4 |
| marry | 2015-11-12 03:14:30 | url5 |
+------------------+----------------------+---------------+--+
分析步骤
获取用户在某个页面停留的起始与结束时间
select userid,
time stime,
lead(time) over(partition by userid order by time) etime,
url
from test.user_log;
结果:
+---------+----------------------+----------------------+-------+--+
| userid | stime | etime | url |
+---------+----------------------+----------------------+-------+--+
| marry | 2015-11-12 01:10:00 | 2015-11-12 01:15:10 | url1 |
| marry | 2015-11-12 01:15:10 | 2015-11-12 01:16:40 | url2 |
| marry | 2015-11-12 01:16:40 | 2015-11-12 02:13:00 | url3 |
| marry | 2015-11-12 02:13:00 | 2015-11-12 03:14:30 | url4 |
| marry | 2015-11-12 03:14:30 | null | url5 |
| peter | 2015-10-12 01:10:00 | 2015-10-12 01:15:10 | url1 |
| peter | 2015-10-12 01:15:10 | 2015-10-12 01:16:40 | url2 |
| peter | 2015-10-12 01:16:40 | 2015-10-12 02:13:00 | url3 |
| peter | 2015-10-12 02:13:00 | 2015-10-12 03:14:30 | url4 |
| peter | 2015-10-12 03:14:30 | null | url5 |
+---------+----------------------+----------------------+-------+--+
计算用户在页面停留的时间间隔(实际分析当中,这里要做数据清洗工作,如果一个用户停留了4、5个小时,那这条记录肯定是不可取的。)
select userid,
time stime,
lead(time) over(partition by userid order by time) etime,
unix_timestamp(lead(time) over(partition by userid order by time),'yyyy-mm-dd hh:mm:ss')- unix_timestamp(time,'yyyy-mm-dd hh:mm:ss') period,
url
from test.user_log;
结果:
+---------+----------------------+----------------------+---------+-------+--+
| userid | stime | etime | period | url |
+---------+----------------------+----------------------+---------+-------+--+
| marry | 2015-11-12 01:10:00 | 2015-11-12 01:15:10 | 310 | url1 |
| marry | 2015-11-12 01:15:10 | 2015-11-12 01:16:40 | 90 | url2 |
| marry | 2015-11-12 01:16:40 | 2015-11-12 02:13:00 | 3380 | url3 |
| marry | 2015-11-12 02:13:00 | 2015-11-12 03:14:30 | 3690 | url4 |
| marry | 2015-11-12 03:14:30 | null | null | url5 |
| peter | 2015-10-12 01:10:00 | 2015-10-12 01:15:10 | 310 | url1 |
| peter | 2015-10-12 01:15:10 | 2015-10-12 01:16:40 | 90 | url2 |
| peter | 2015-10-12 01:16:40 | 2015-10-12 02:13:00 | 3380 | url3 |
| peter | 2015-10-12 02:13:00 | 2015-10-12 03:14:30 | 3690 | url4 |
| peter | 2015-10-12 03:14:30 | null | null | url5 |
+---------+----------------------+----------------------+---------+-------+--+
计算每个页面停留的总时间,某个用户访问某个页面的总时间
select nvl(url,'-1') url,
nvl(userid,'-1') userid,
sum(period) totol_peroid from (
select userid,
time stime,
lead(time) over(partition by userid order by time) etime,
unix_timestamp(lead(time) over(partition by userid order by time),'yyyy-mm-dd hh:mm:ss')- unix_timestamp(time,'yyyy-mm-dd hh:mm:ss') period,
url
from test.user_log
) a group by url, userid with rollup;
结果:
+-------+---------+---------------+--+
| url | userid | totol_peroid |
+-------+---------+---------------+--+
| -1 | -1 | 14940 |
| url1 | -1 | 620 |
| url1 | marry | 310 |
| url1 | peter | 310 |
| url2 | -1 | 180 |
| url2 | marry | 90 |
| url2 | peter | 90 |
| url3 | -1 | 6760 |
| url3 | marry | 3380 |
| url3 | peter | 3380 |
| url4 | -1 | 7380 |
| url4 | marry | 3690 |
| url4 | peter | 3690 |
| url5 | -1 | null |
| url5 | marry | null |
| url5 | peter | null |
+-------+---------+---------------+--+
来源:CSDN
作者:北京小峻
链接:https://blog.csdn.net/weixin_45896475/article/details/104065803