问题
How to compare two tables with same structures in hive. I believe minus will not work in hive.
SRC table:
id name
1 A
2 B
3 C
TGT table:
id name
1 A
2 C
3 C
can anyone help me with a query.
回答1:
If you are looking for equality between two tables and for differences if any, you can do like following
SELECT MIN(TableName) as TableName, ID, NAME
FROM
(
SELECT 'SRC_TABLE' as TableName, A.ID, A.NAME
FROM A
UNION ALL
SELECT 'TGT_TABLE' as TableName, B.ID, B.NAME
FROM B
) tmp
GROUP BY ID, NAME
HAVING COUNT(*) = 1
ORDER BY ID
Basically what the above query does is as follows,
Combines all the records from both the tables & groups them by all columns. Later prints out only those records which are unique [having count(*) = 1].
If the above query prints zero records then it means both the tables are perfectly equal
- If the above query prints any records at all, then it means those particular records are not present in one or the other tables
For the above example you listed the output is as follows,
[ TBL_NAME | ID | NAME ]----------------------------[ SRC_TABLE | 2 | B ][ TGT_TABLE | 2 | C ]
Credits: I found the above solution from this website. I tried it, and it was fantastic!
回答2:
select id
,SRC_name
,TGT_name
from (select id
,min (case tab when 'S' then name end) as SRC_name
,min (case tab when 'T' then name end) as TGT_name
,count (case tab when 'S' then 1 end) as SRC_count
,count (case tab when 'T' then 1 end) as TGT_count
from ( select 'S' as tab,id,name from SRC
union all select 'T' as tab,id,name from TGT
) t
group by id
) t
having SRC_count = 0
or TGT_count = 0
or SRC_name != TGT_name
;
If name might be NULL or id might be non-unique, additional conditions should be added
回答3:
To do a full comparison of 2 tables, my experience has shown me that using some checksum mechanism is a more effective and reliable solution than Joining the tables (which gives some difficulties when keys are repeated for instance).
You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq
回答4:
There is a free tool (dataq.io) which can be used to compare data between two tables in hive. It can visually display the differences.
Disclaimer : I work for dataq.io
回答5:
Minus does not exists in Hive. You can use left join or full join for find different records from both tables.
LEFT JOIN: - allows to find records that not exist in TGT
select a.id, a.name
from TGT a left join TGT b on a.id=b.id and a.name=b.name
where b.id is null
The same may be achieved using NOT EXISTS
select a.id, a.name
from TGT a
where not exists(select 1 from TGT b where a.id=b.id and a.name=b.name)
FULL JOIN: - allows to find differences from both the tables
select a.id as SRC_ID, a.name as SRC_NAME ,
b.id as TGT_ID, b.name as TGT_NAME
from TGT a full join TGT b on a.id=b.id and a.name=b.name
where a.id is null OR b.id is null
来源:https://stackoverflow.com/questions/42489823/compare-data-between-two-tables-with-same-structure-in-hive