Optimizing Hive GROUP BY when rows are sorted

问题

I have the following (very simple) Hive query:

select user_id, event_id, min(time) as start, max(time) as end,
       count(*) as total, count(interaction == 1) as clicks
from events_all
group by user_id, event_id;

The table has the following structure:

user_id                 event_id                time            interaction 
Ex833Lli36nxTvGTA1Dv    juCUv6EnkVundBHSBzQevw  1430481530295   0
Ex833Lli36nxTvGTA1Dv    juCUv6EnkVundBHSBzQevw  1430481530295   1
n0w4uQhOuXymj5jLaCMQ    G+Oj6J9Q1nI1tuosq2ZM/g  1430512179696   0
n0w4uQhOuXymj5jLaCMQ    G+Oj6J9Q1nI1tuosq2ZM/g  1430512217124   0
n0w4uQhOuXymj5jLaCMQ    mqf38Xd6CAQtuvuKc5NlWQ  1430512179696   1

I know for a fact that rows are sorted first by user_id and then by event_id.

The question is: is there a way to "hint" the Hive engine to optimize the query given that rows are sorted? The purpose of optimization is to avoid keeping all groups in memory since its only necessary to keep one group at a time.

Right now this query running in a 6-node 16 GB Hadoop cluster with roughly 300 GB of data takes about 30 minutes and uses most of the RAM, choking the system. I know that each group will be small, no more than 100 rows per (user_id, event_id) tuple, so I think an optimized execution will probably have a very small memory footprint and also be faster (since there is no need to loopup group keys).

回答1:

Create a bucketed sorted table. The optimizer will know it sorted from metadata. See example here (official docs): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSortedTables

Count only interaction = 1: count(case when interaction=1 then 1 end) as clicks - case will mark all rows with 1 or null and count only 1s.

来源：https://stackoverflow.com/questions/41247839/optimizing-hive-group-by-when-rows-are-sorted

标签

sql

Hadoop

Hive

query-optimization

hiveql