Hive LEFT SEMI JOIN for 'NOT EXISTS'

白昼怎懂夜的黑 提交于 2019-12-18 20:05:15

问题


I have two tables with a single key column. Keys in table a are subset of all keys in table b. I need to select keys from table b that are NOT in table a.

Here is a citation from Hive manual: "LEFT SEMI JOIN implements the uncorrelated IN/EXISTS subquery semantics in an efficient way. As of Hive 0.13 the IN/NOT IN/EXISTS/NOT EXISTS operators are supported using subqueries so most of these JOINs don't have to be performed manually anymore. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc."

They use this example for illustration:

    SELECT a.key, a.value FROM a WHERE a.key IN (SELECT b.key FROM B);

Is equivalent to

    SELECT a.key, a.val FROM a LEFT SEMI JOIN b ON (a.key = b.key);

However, what I need to do is first example with 'NOT IN;. Unfortunately this syntax is not supported in Hive 0.13. It's for illustration only:

    SELECT a.key, a.value FROM a WHERE a.key NOT IN (SELECT b.key FROM B);

I searched this site for recommendations, and saw this example:

    SELECT a.key FROM a LEFT OUTER JOIN b ON a.key = b.key WHERE b.key IS NULL;

It does not work as expected. When I join a.key NOT in b and a.key IN b, I don't get the original a this way. Maybe that is because this query cannot do the trick, note bold text - b.key should not appear in WHERE.

What should I do then? Any other trick? Thanks!

P.S. I cannot share any real data; it's a pretty simple example, where keys in a are all included in b and a is a subset of b.


回答1:


If you want results from table b, perhaps you can do the following instead?

  SELECT b.key FROM b LEFT OUTER JOIN a ON b.key = a.key WHERE a.key IS NULL;



回答2:


The answer to your issue should be

SELECT a.key FROM a LEFT OUTER JOIN b ON a.key = b.key WHERE b.key IS NULL;

This means, bring all the keys from a, irrespective of whether there is a match in b or not. The where cause will filter those records, which are not available in b.




回答3:


Or you can try

SELECT a.key FROM a LEFT ANTI JOIN b ON a.key = b.key



回答4:


I tried left semi join for IN function in cdh 5.7.0 with spark 1.6 version.

The semi left join gives wrong results, which is not similar to IN function in sub queries.



来源:https://stackoverflow.com/questions/25041026/hive-left-semi-join-for-not-exists

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!