Hive: How to do a SELECT query to output a unique primary key using HiveQL?

问题

I have the following schema dataset which i want to transform into a table that can be exported to SQL. I am using HIVE. Input as follows

call_id,stat1,stat2,stat3
1,a,b,c,
2,x,y,z,
3,d,e,f,
1,j,k,l,

The output table needs to have call_id as its primary key so it needs to be unique. The output schema should be

call_id,stat2,stat3,
1,b,c, or (1,k,l)
2,y,z,
3,e,f,

The problem is that when i use the keyword DISTINCT in the HIVE query, the DISTINCT applies to the all the colums combined. I want to apply the DISTINCT operation only to the call_id. Something on the lines of

SELECT DISTINCT(call_id), stat2,stat3 from intable;

However this is not valid in HIVE(I am not well-versed in SQL either).

The only legal query seems to be

SELECT DISTINCT call_id, stat2,stat3 from intable;

But this returns multiple rows with same call_id as the other columns are different and the row on the whole is distinct.

NOTE: There is no arithmetic relation between a,b,c,x,y,z, etc. So any trick of averaging or summing is not viable.

Any ideas how i can do this?

回答1:

One quick idea,not the best one, but will do the work-

hive>create table temp1(a int,b string);

hive>insert overwrite table temp1

select call_id,max(concat(stat1,'|',stat2,'|',stat3)) from intable group by call_id;

hive>insert overwrite table intable

select a,split(b,'|')[0],split(b,'|')[1],split(b,'|')[2] from temp1;

回答2:

,,I want to apply the DISTINCT operation only to the call_id"

But how will then Hive know which row to eliminate?

Without knowing the amount of data / size of the stat fields you have, the following query can the job:

select distinct i1.call_id, i1.stat2, i1.stat3 from (
  select call_id, MIN(concat(stat1, stat2, stat3)) as smin 
  from intable group by call_id
) i2 join intable i1 on i1.call_id = i2.call_id 
  AND concat(i1.stat1, i1.stat2, i1.stat3) = i2.smin;

来源：https://stackoverflow.com/questions/15023661/hive-how-to-do-a-select-query-to-output-a-unique-primary-key-using-hiveql

标签

select

Hadoop

distinct

Hive