Hive Full Outer Join Returning multiple rows for same Join Key

本小妞迷上赌 提交于 2021-02-05 12:18:07

问题


I am doing full outer join on 4 tables on the same column. I want to generate only 1 row for each different value in the Join column.

Inputs are:

employee1
+---------------------+-----------------+--+
| employee1.personid  | employee1.name  |
+---------------------+-----------------+--+
| 111                 | aaa             |
| 222                 | bbb             |   
| 333                 | ccc             | 
+---------------------+-----------------+--+
employee2
+---------------------+----------------+--+
| employee2.personid  | employee2.sal  |
+---------------------+----------------+--+
| 111                 | 2              |
| 200                 | 3              |
+---------------------+----------------+--+
employee3
+---------------------+------------------+--+
| employee3.personid  | employee3.place  |
+---------------------+------------------+--+
| 111                 | bbsr             |
| 300                 | atl              |
| 200                 | ny               |
+---------------------+------------------+--+
employee4
+---------------------+---------------+--+
| employee4.personid  | employee4.dt  |
+---------------------+---------------+--+
| 111                 | 2019-02-21    |
| 300                 | 2019-03-18    |
| 400                 | 2019-03-18    |
+---------------------+---------------+--+

Expected Result one record for each personid, so total there should be 6 records(111,222,333,200,300,400) Like:

+-----------+---------+--------+----------+-------------+--+
| personid  | f.name  | u.sal  | v.place  |   v_in.dt   |
+-----------+---------+--------+----------+-------------+--+
| 111       | aaa     | 2      | bbsr     | 2019-02-21  |
| 200       | NULL    | 3      | ny       | NULL        |
| 222       | bbb     | NULL   | NULL     | NULL        |
| 300       | NULL    | NULL   | atl      | 2019-03-18  |
| 333       | ccc     | NULL   | NULL     | NULL        |
| 400       | NULL    | NULL   | NULL     | 2019-03-18  |
+-----------+---------+--------+----------+-------------+--+

Result i am getting is:

+-----------+---------+--------+----------+-------------+--+
| personid  | f.name  | u.sal  | v.place  |   v_in.dt   |
+-----------+---------+--------+----------+-------------+--+
| 111       | aaa     | 2      | bbsr     | 2019-02-21  |
| 200       | NULL    | 3      | NULL     | NULL        |
| 200       | NULL    | NULL   | ny       | NULL        |
| 222       | bbb     | NULL   | NULL     | NULL        |
| 300       | NULL    | NULL   | atl      | NULL        |
| 300       | NULL    | NULL   | NULL     | 2019-03-18  |
| 333       | ccc     | NULL   | NULL     | NULL        |
| 400       | NULL    | NULL   | NULL     | 2019-03-18  |
+-----------+---------+--------+----------+-------------+--+

Query used:

select coalesce(f.personid, u.personid, v.personid, v_in.personid) as personid,f.name,u.sal,v.place,v_in.dt
from employee1 f FULL OUTER JOIN employee2 u on f.personid=u.personid
FULL OUTER JOIN employee3 v on f.personid=v.personid
FULL OUTER JOIN employee4 v_in on f.personid=v_in.personid;

Please suggest how to generate the expected result.


回答1:


full outer join is tricky, because you have to take previous NULLs into account. But you can do:

select coalesce(f.personid, u.personid, v.personid, v_in.personid) as personid,f.name,u.sal,v.place,v_in.dt
from employee1 f FULL OUTER JOIN
     employee2 u
     on f.personid = u.personid FULL OUTER JOIN
     employee3 v
     on v.personid in (f.person_id, u.person_id) FULL OUTER JOIN
     employee4 v_in
     on v_in.personid in (f.person_id, u.person_id, v.person_id);

In databases that support using for joins (instead of on) this is simpler. I don't think that Hive supports using, though.




回答2:


FULL JOIN returns all rows joined + all not joined from the left table + all not joined from the right table. And since you are joining employee2, employee3, employee4 to the same employee1 table which does not contain personid=200, all not joined rows returned from all four tables.

I'd suggest to UNION ALL all four tables providing NULLs for missing fields + aggregate group by personid:

select personid, max(name) name, max(sal) sal, max(place) place, max(dt) dt 
from 
(
select  personid, name, NULL sal, NULL place, NULL dt from employee1  e1
UNION ALL
select  personid, NULL name, sal, NULL place, NULL dt from employee2  e2
UNION ALL
select  personid, NULL name, NULL sal, place, NULL dt from employee3  e3
UNION ALL
select  personid, NULL name, NULL sal, NULL place, dt from employee4  e4
)s
group by personid;

This will perform better than joins.



来源:https://stackoverflow.com/questions/55225131/hive-full-outer-join-returning-multiple-rows-for-same-join-key

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!