IN vs. JOIN with large rowsets

前端未结

关注

 12  1735

I\'m wanting to select rows in a table where the primary key is in another table. I\'m not sure if I should use a JOIN or the IN operator in SQL Server 2005. Is there any si

相关标签:

12条回答

感情败类

2020-11-30 21:16
Update:

This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:
- IN vs. JOIN vs. EXISTS
```
SELECT  *
FROM    a
WHERE   a.c IN (SELECT d FROM b)

SELECT  a.*
FROM    a
JOIN    b
ON      a.c = b.d
```
These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).

The equivalent of the first query is the following:
```
SELECT  a.*
FROM    a
JOIN    (
        SELECT  DISTINCT d
        FROM    b
        ) bo
ON      a.c = bo.d
```
If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.

SQL Server can employ one of the following methods to run this query:
- If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)
- If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.
- If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used
- If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.
Neither of these methods reevaluates the whole subquery each time.

See this entry in my blog for more detail on how this works:
- Counting missing rows: SQL Server
There are links for all RDBMS's of the big four.
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦谈多话

2020-11-30 21:16

From MSDN documentation on Subquery Fundamentals:

Many Transact-SQL statements that include subqueries can be alternatively formulated as joins. Other questions can be posed only with subqueries. In Transact-SQL, there is usually no performance difference between a statement that includes a subquery and a semantically equivalent version that does not. However, in some cases where existence must be checked, a join yields better performance. Otherwise, the nested query must be processed for each result of the outer query to ensure elimination of duplicates. In such cases, a join approach would yield better results.

In the example you've provided, the nested query need only be processed a single time for each of the outer query results, so there should be no performance difference. Checking the execution plans for both queries should confirm this.

Note: Though the question itself didn't specify SQL Server 2005, I answered with that assumption based on the question tags. Other database engines (even different SQL Server versions) may not optimize in the same way.

0 讨论(0)
发布评论:

提交评论
- 加载中...
情深已故

2020-11-30 21:17
Speaking from experience on a Table with 49,000,000 rows I would recommend LEFT OUTER JOIN. Using IN, or EXISTS Took 5 minutes to complete where the LEFT OUTER JOIN finishes in 1 second.
```
SELECT a.*
FROM a LEFT OUTER JOIN b ON a.c = b.d
WHERE b.d is not null -- Given b.d is a primary Key with index
```
Actually in my query I do this across 9 tables.
0 讨论(0)
发布评论:

提交评论
- 加载中...
生来不讨喜

2020-11-30 21:24

The IN is evaluated (and the select from b re-run) for each row in a, whereas the JOIN is optimized to use indices and other neat paging tricks...

In most cases, though, the optimizer would likely be able to construct a JOIN out of a correlated subquery and end up with the same execution plan anyway.

Edit: Kindly read the comments below for further... discussion about the validity of this answer, and the actual answer to the OP's question. =)

0 讨论(0)
发布评论:

提交评论
- 加载中...
我在风中等你

2020-11-30 21:24

Theory will only get you so far on questions like this. At the end of the day, you'll want to test both queries and see which actually runs faster. I've had cases where the JOIN version took over a minute and the IN version took less than a second. I've also had cases where JOIN was actually faster.

Personally, I tend to start off with the IN version if I know I won't need any fields from the subquery table. If that starts running slow, I'll optimize. Fortunately, for large datasets, rewriting the query makes such a noticeable difference that you can simply time it from Query Analyzer and know you're making progress.

Good luck!

0 讨论(0)
发布评论:

提交评论
- 加载中...
青春惊慌失措

2020-11-30 21:25
They are different queries with different results. With the IN query you will get 1 row from table 'a' whenever the predicate matches. With the INNER JOIN query you will get a*b rows whenever the join condition matches. So with values in a of {1,2,3} and b of {1,2,2,3} you will get 1,2,2,3 from the JOIN and 1,2,3 from the IN.

EDIT - I think you may come across a few answers in here that will give you a misconception. Go test it yourself and you will see these are all fine query plans:
```
create table t1 (t1id int primary key clustered)
create table t2 (t2id int identity primary key clustered
    ,t1id int references t1(t1id)
)


insert t1 values (1)
insert t1 values (2)
insert t1 values (3)
insert t1 values (4)
insert t1 values (5)

insert t2 values (1)
insert t2 values (2)
insert t2 values (2)
insert t2 values (3)
insert t2 values (4)


select * from t1 where t1id in (select t1id from t2)
select * from t1 where exists (select 1 from t2 where t2.t1id = t1.t1id)
select t1.* from t1 join t2 on t1.t1id = t2.t1id
```
The first two plans are identical. The last plan is a nested loop, this difference is expected because as I mentioned above the join has different semantics.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页