hive的lead、lag函数定义与使用

别等时光非礼了梦想. 提交于 2020-01-22 05:38:29

hive的lead、lag函数定义与使用

lag和lead分析函数可以在同一次查询中取出同一字段的前n行的数据(lag)和后n行的数据(lead)作为独立的列。
这种操作可以代替表的自联接,并且lag和lead有更高的效率,其中over()表示当前查询的结果集对象,括号里面的语句则表示对这个结果集进行处理。

函数介绍

lag
lag(col,n,default) 用于统计窗口内往上第n行值
参数1为列名,参数2为往上第n行(可选,默认为1),参数3为默认值(当往上第n行为null时候,取默认值,如不指定,则为null)

lead
与lag相反
lead(col,n,default) 用于统计窗口内往下第n行值
参数1为列名,参数2为往下第n行(可选,默认为1),参数3为默认值(当往下第n行为null时候,取默认值,如不指定,则为null)

lead的用法

用户peter在浏览网页,在某个时刻,peter点进了某个页面,过一段时间后,peter又进入了另外一个页面,如此反复,那怎么去统计peter在某个特定网页的停留时间呢,又或是怎么统计某个网页用户停留的总时间呢?
数据准备
现在用户的行为都被采集了,处理转换到hive数据表,表结构如下:

create table test.user_log(
    userid string,
    time string,
    url string
) row format delimited fields terminated by ',';

记录数据:

+------------------+----------------------+---------------+--+
| user_log.userid  |    user_log.time     | user_log.url  |
+------------------+----------------------+---------------+--+
| peter            | 2015-10-12 01:10:00  | url1          |
| peter            | 2015-10-12 01:15:10  | url2          |
| peter            | 2015-10-12 01:16:40  | url3          |
| peter            | 2015-10-12 02:13:00  | url4          |
| peter            | 2015-10-12 03:14:30  | url5          |
| marry            | 2015-11-12 01:10:00  | url1          |
| marry            | 2015-11-12 01:15:10  | url2          |
| marry            | 2015-11-12 01:16:40  | url3          |
| marry            | 2015-11-12 02:13:00  | url4          |
| marry            | 2015-11-12 03:14:30  | url5          |
+------------------+----------------------+---------------+--+

分析步骤
获取用户在某个页面停留的起始与结束时间

select userid,
       time stime,
       lead(time) over(partition by userid order by time) etime,
       url 
  from test.user_log;

结果:

+---------+----------------------+----------------------+-------+--+
| userid  |        stime         |        etime         |  url  |
+---------+----------------------+----------------------+-------+--+
| marry   | 2015-11-12 01:10:00  | 2015-11-12 01:15:10  | url1  |
| marry   | 2015-11-12 01:15:10  | 2015-11-12 01:16:40  | url2  |
| marry   | 2015-11-12 01:16:40  | 2015-11-12 02:13:00  | url3  |
| marry   | 2015-11-12 02:13:00  | 2015-11-12 03:14:30  | url4  |
| marry   | 2015-11-12 03:14:30  | null                 | url5  |
| peter   | 2015-10-12 01:10:00  | 2015-10-12 01:15:10  | url1  |
| peter   | 2015-10-12 01:15:10  | 2015-10-12 01:16:40  | url2  |
| peter   | 2015-10-12 01:16:40  | 2015-10-12 02:13:00  | url3  |
| peter   | 2015-10-12 02:13:00  | 2015-10-12 03:14:30  | url4  |
| peter   | 2015-10-12 03:14:30  | null                 | url5  |
+---------+----------------------+----------------------+-------+--+

计算用户在页面停留的时间间隔(实际分析当中,这里要做数据清洗工作,如果一个用户停留了4、5个小时,那这条记录肯定是不可取的。)

select userid,
       time stime,
       lead(time) over(partition by userid order by time) etime,
       unix_timestamp(lead(time) over(partition by userid order by time),'yyyy-mm-dd hh:mm:ss')- unix_timestamp(time,'yyyy-mm-dd hh:mm:ss') period,
       url 
  from test.user_log;

结果:

+---------+----------------------+----------------------+---------+-------+--+
| userid  |        stime         |        etime         | period  |  url  |
+---------+----------------------+----------------------+---------+-------+--+
| marry   | 2015-11-12 01:10:00  | 2015-11-12 01:15:10  | 310     | url1  |
| marry   | 2015-11-12 01:15:10  | 2015-11-12 01:16:40  | 90      | url2  |
| marry   | 2015-11-12 01:16:40  | 2015-11-12 02:13:00  | 3380    | url3  |
| marry   | 2015-11-12 02:13:00  | 2015-11-12 03:14:30  | 3690    | url4  |
| marry   | 2015-11-12 03:14:30  | null                 | null    | url5  |
| peter   | 2015-10-12 01:10:00  | 2015-10-12 01:15:10  | 310     | url1  |
| peter   | 2015-10-12 01:15:10  | 2015-10-12 01:16:40  | 90      | url2  |
| peter   | 2015-10-12 01:16:40  | 2015-10-12 02:13:00  | 3380    | url3  |
| peter   | 2015-10-12 02:13:00  | 2015-10-12 03:14:30  | 3690    | url4  |
| peter   | 2015-10-12 03:14:30  | null                 | null    | url5  |
+---------+----------------------+----------------------+---------+-------+--+

计算每个页面停留的总时间,某个用户访问某个页面的总时间

select nvl(url,'-1') url,
       nvl(userid,'-1') userid,
       sum(period) totol_peroid from (
select userid,
       time stime,
       lead(time) over(partition by userid order by time) etime,
       unix_timestamp(lead(time) over(partition by userid order by time),'yyyy-mm-dd hh:mm:ss')- unix_timestamp(time,'yyyy-mm-dd hh:mm:ss') period,
       url 
  from test.user_log
) a group by url, userid with rollup;

结果:

+-------+---------+---------------+--+
|  url  | userid  | totol_peroid  |
+-------+---------+---------------+--+
| -1    | -1      | 14940         |
| url1  | -1      | 620           |
| url1  | marry   | 310           |
| url1  | peter   | 310           |
| url2  | -1      | 180           |
| url2  | marry   | 90            |
| url2  | peter   | 90            |
| url3  | -1      | 6760          |
| url3  | marry   | 3380          |
| url3  | peter   | 3380          |
| url4  | -1      | 7380          |
| url4  | marry   | 3690          |
| url4  | peter   | 3690          |
| url5  | -1      | null          |
| url5  | marry   | null          |
| url5  | peter   | null          |
+-------+---------+---------------+--+
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!