问题
My current data looks like this (note that it is sorted on datetime):
+----------------+---------------------+---------+
| CustomerNumber | Date | Channel |
+----------------+---------------------+---------+
| 120584446 | 2015-05-22 21:16:05 | A |
| 120584446 | 2015-05-25 18:04:16 | A |
| 120584446 | 2015-05-25 18:05:25 | B |
| 120584446 | 2015-05-28 20:35:09 | A |
| 120584446 | 2015-05-28 20:36:01 | A |
| 120584446 | 2015-05-28 20:37:02 | B |
| 120584446 | 2015-05-29 13:39:00 | B |
+----------------+---------------------+---------+
I want to create a rank in hive that splits on cutomer number and whenever the channel is A. It should look like this:
+----------------+---------------------+----------------+------+
| CustomerNumber | Date | Channel | Rank |
+----------------+---------------------+----------------+------+
| 120584446 | 2015-05-22 21:16:05 | A | 1 |
| 120584446 | 2015-05-25 18:04:16 | A | 1 |
| 120584446 | 2015-05-25 18:05:25 | B | 2 |
| 120584446 | 2015-05-28 20:35:09 | A | 1 |
| 120584446 | 2015-05-28 20:36:01 | A | 1 |
| 120584446 | 2015-05-28 20:37:02 | B | 2 |
| 120584446 | 2015-05-29 13:39:00 | B | 3 |
+----------------+---------------------+----------------+------+
回答1:
One approach is to use a cumulative conditional sum to identify the groups and then use row_number() for the ranking:
select t.*,
row_number() over (partition by CustomerNumber, grp
order by date
) as rank
from (select t.*,
sum(case when channel = 'A' then 1 else 0 end) over
(partition by CustomerNumber order by date) as grp
from t
) t;
来源:https://stackoverflow.com/questions/34355381/creating-a-rank-that-resets-on-a-specific-value-of-a-column