compare data between two tables with same structure in hive

丶灬走出姿态 提交于 2019-12-12 16:42:45

问题


How to compare two tables with same structures in hive. I believe minus will not work in hive.

SRC table:

id name
1   A
2   B
3   C

TGT table:

id name

1   A
2   C
3   C

can anyone help me with a query.


回答1:


If you are looking for equality between two tables and for differences if any, you can do like following

SELECT MIN(TableName) as TableName, ID, NAME FROM ( SELECT 'SRC_TABLE' as TableName, A.ID, A.NAME FROM A UNION ALL SELECT 'TGT_TABLE' as TableName, B.ID, B.NAME FROM B ) tmp GROUP BY ID, NAME HAVING COUNT(*) = 1 ORDER BY ID

Basically what the above query does is as follows,

  • Combines all the records from both the tables & groups them by all columns. Later prints out only those records which are unique [having count(*) = 1].

  • If the above query prints zero records then it means both the tables are perfectly equal

  • If the above query prints any records at all, then it means those particular records are not present in one or the other tables
  • For the above example you listed the output is as follows,

    [ TBL_NAME | ID | NAME ]

    ----------------------------

    [ SRC_TABLE | 2 | B ]

    [ TGT_TABLE | 2 | C ]

Credits: I found the above solution from this website. I tried it, and it was fantastic!




回答2:


select      id
           ,SRC_name
           ,TGT_name

from       (select      id
                       ,min   (case tab when 'S' then name end) as SRC_name
                       ,min   (case tab when 'T' then name end) as TGT_name
                       ,count (case tab when 'S' then 1    end) as SRC_count 
                       ,count (case tab when 'T' then 1    end) as TGT_count 

            from        (           select 'S' as tab,id,name from SRC
                        union all   select 'T' as tab,id,name from TGT
                        ) t

            group by    id  
            ) t

having      SRC_count = 0
        or  TGT_count = 0
        or  SRC_name != TGT_name
;

If name might be NULL or id might be non-unique, additional conditions should be added




回答3:


To do a full comparison of 2 tables, my experience has shown me that using some checksum mechanism is a more effective and reliable solution than Joining the tables (which gives some difficulties when keys are repeated for instance).

You could have a look at this Python program that handles such comparisons of Hive tables (comparing all the rows and all the columns), and would show you in a webpage the differences that might appear: https://github.com/bolcom/hive_compared_bq




回答4:


There is a free tool (dataq.io) which can be used to compare data between two tables in hive. It can visually display the differences.

Disclaimer : I work for dataq.io




回答5:


Minus does not exists in Hive. You can use left join or full join for find different records from both tables.

LEFT JOIN: - allows to find records that not exist in TGT

select a.id, a.name 
  from TGT a left join TGT b on a.id=b.id and a.name=b.name
where b.id is null

The same may be achieved using NOT EXISTS

select a.id, a.name 
      from TGT a 
where not exists(select 1 from  TGT b where a.id=b.id and a.name=b.name)

FULL JOIN: - allows to find differences from both the tables

select a.id as SRC_ID, a.name as SRC_NAME ,
       b.id as TGT_ID, b.name as TGT_NAME
  from TGT a full join TGT b on a.id=b.id and a.name=b.name
where a.id is null OR b.id is null


来源:https://stackoverflow.com/questions/42489823/compare-data-between-two-tables-with-same-structure-in-hive

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!