问题
I have two SQL queries producing different results when I would expect them to produce the same result. I am trying to find the number of events that do not have a corresponding location. All locations have an event but events can also link to non-location records.
The following query produces a count of 16244, the correct value.
SELECT COUNT(DISTINCT e.event_id)
FROM events AS e
WHERE NOT EXISTS
(SELECT * FROM locations AS l WHERE l.event_id = e.event_id)
The following query produces a count of 0.
SELECT COUNT(DISTINCT e.event_id)
FROM events AS e
WHERE e.event_id NOT IN (SELECT l.event_id FROM locations AS l)
The following SQL does some summaries of the data set
SELECT 'Event Count',
COUNT(DISTINCT event_id)
FROM events
UNION ALL
SELECT 'Locations Count',
COUNT(DISTINCT event_id)
FROM locations
UNION ALL
SELECT 'Event+Location Count',
COUNT(DISTINCT l.event_id)
FROM locations AS l JOIN events AS e ON l.event_Id = e.event_id
And returns the following results
Event Count 139599 Locations Count 123355 Event+Location Count 123355
Can anyone shed any light on why the 2 initial queries do not produce the same figure.
回答1:
You have a NULL in the subquery SELECT l.event_id FROM locations AS l so NOT IN will always evaluate to unknown and return 0 results
SELECT COUNT(DISTINCT e.event_id)
FROM events AS e
WHERE e.event_id NOT IN (SELECT l.event_id FROM locations AS l)
The reason for this behaviour can be seen from the below example.
'x' NOT IN (NULL,'a','b')
≡ 'x' <> NULL and 'x' <> 'a' and 'x' <> 'b'
≡ Unknown and True and True
≡ Unknown
回答2:
The NOT IN form works differently for NULLs. The presence of a single NULL will cause the entire statement to fail, thus returning no results.
So you have at least one event_id in locations that is NULL.
Also, your query might be better written as a join:
SELECT
COUNT(DISTINCT e.event_id)
FROM
events AS e
LEFT JOIN locations AS l ON e.event_id = l.event_id
WHERE
l.event_id IS NULL
[UPDATE: apparently, the NOT EXISTS version is faster.]
回答3:
In and Exists are processed very very differently.
Select * from T1 where x in ( select y from T2 )
is typically processed as:
select *
from t1, ( select distinct y from t2 ) t2
where t1.x = t2.y;
The subquery is evaluated, distinct'ed, indexed (or hashed or sorted) and then joined to the original table -- typically.
As opposed to
select * from t1 where exists ( select null from t2 where y = x )
That is processed more like:
for x in ( select * from t1 )
loop
if ( exists ( select null from t2 where y = x.x )
then
OUTPUT THE RECORD
end if
end loop
It always results in a full scan of T1 whereas the first query can make use of an index on T1(x).
来源:https://stackoverflow.com/questions/2973547/why-is-my-sql-not-in-clause-producing-different-results-from-not-exists