How to add an integer unique id to query results - efficiently?

后端未结

关注

 4  441

面向向阳花 2021-01-25 17:08

Given a query, select * from ... (that might be part of CTAS statement)

The goal is to add an additional column, ID, where ID is a

4条回答

没有蜡笔的小新 (楼主)

2021-01-25 17:59

hive

set mapred.reduce.tasks=1000;
set hivevar:buckets=10000;

hivevar:buckets should be high enough relatively to the number of reducers (mapred.reduce.tasks), so the rows will be evenly distributed between the reduces.

select  1 + x + (row_number() over (partition by x) - 1) * ${hivevar:buckets}  as id
       ,t.*

from   (select  t.*
               ,abs(hash(rand())) % ${hivevar:buckets} as x      

        from    t
        ) t

spark-sql

select  1 + x + (row_number() over (partition by x) - 1) * 10000  as id
       ,t.*

from   (select  t.*
               ,abs(hash(rand())) % 10000 as x      

        from    t
        ) t

For both hive and spark-sql

The rand() is used to generate a good distribution.
If You already have in your query a column / combination of columns with good distribution (might be unique, not a must) you might use it instead, e.g. -

select    1 + (abs(hash(col1,col)) % 10000) 
        + (row_number() over (partition by abs(hash(col1,col)) % 10000) - 1) * 10000  as id
       ,t.*

from    t

0 讨论(0)

查看其它4个回答

How to add an integer unique id to query results - __efficiently__?

How to add an integer unique id to query results - efficiently?