Explanation of using the operator EXISTS on a correlated subqueries

问题

What is an explanation of the mechanics behind the following Query?

It looks like a powerful method of doing dynamic filtering on a table.

CREATE TABLE tbl (ID INT, amt INT)
INSERT tbl VALUES
(1,1),  
(1,1),  
(1,2),
(1,3),
(2,3),  
(2,400),
(3,400),
(3,400)

SELECT *
FROM tbl T1
WHERE EXISTS
  (
    SELECT *
    FROM tbl T2
    WHERE 
       T1.ID = T2.ID AND
       T1.amt < T2.amt
  )

Live test of it here on SQL Fiddle

回答1:

You can usually convert correlated subqueries into an equivalent expression using explicit joins. Here is one way:

SELECT distinct t1.*
FROM tbl T1 left outer join
     tbl t2
     on t1.id = t2.id and
        t1.amt < t2.amt
where t2.id is null

Martin Smith shows another way.

The question of whether they are a "powerful way of doing dynamic filtering" is true, but (usually) unimportant. You can do the same filtering using other SQL constructs.

Why use correlated subqueries? There are several positives and several negatives, and one important reason that is both. On the positive side, you do not have to worry about "multiplication" of rows, as happens in the above query. Also, when you have other filtering conditions, the correlated subquery is often more efficient. And, sometimes using delete or update, it seems to be the only way to express a query.

The Achilles heel is that many SQL optimizers implement correlated subqueries as nested loop joins (even though do not have to). So, they can be highly inefficient at times. However, the particular "exists" construct that you have is often quite efficient.

In addition, the nature of the joins between the tables can get lost in nested subqueries, which complicated conditions in where clauses. It can get hard to understand what is going on in more complicated cases.

My recommendation. If you are going to use them on large tables, learn about SQL execution plans in your database. Correlated subqueries can bring out the best or the worst in SQL performance.

Possible Edit. This is more equivalent to the script in the OP:

SELECT distinct t1.*
FROM tbl T1 inner join
     tbl t2
     on t1.id = t2.id and
        t1.amt < t2.amt

回答2:

Let's translate this to english:

"Select rows from tbl where tbl has a row of the same ID and bigger amt."

What this does is select everything except the rows with maximum values of amt for each ID.

Note, the last line SELECT * FROM tbl is a separate query and probably not related to the question at hand.

回答3:

As others have already pointed out, using EXISTS in a correlated subquery is essentially telling the database engine "return all records for which there is a corresponding record which meets the criteria specified in the subquery." But there's more.

The EXISTS keyword represents a boolean value. It could also be taken to mean "Where at least one record exists that matches the criteria in the WHERE statement." In other words, if a single record is found, "I'm done, and I don't need to search any further."

The efficiency gain that CAN result from using EXISTS in a correlated subquery comes from the fact that as soon as EXISTS returns TRUE, the subquery stops scanning records and returns a result. Similarly, a subquery which employs NOT EXISTS will return as soon as ANY record matches the criteria in the WHERE statement of the subquery.

I believe the idea is that the subquery using EXISTS is SUPPOSED to avoid the use of nested loop searches. As @Gordon Linoff states above though, the query optimizer may or may not perform as desired. I believe MS SQL Server usually takes full advantage of EXISTS.

My understanding is that not all queries benefit from EXISTS, but often, they will, particularly in the case of simple structures such as that in your example.

I may have butchered some of this, but conceptually I believe it's on the right track.

The caveat is that if you have a performance-critical query, it would be best to evaluate execution of a version using EXISTS with one using simple JOINS as Mr. Linoff indicates. Depending on your database engine, table structure, time of day, and the alignment of the moon and stars, it is not cut-and-dried which will be faster.

Last note - I agree with lc. When you use SELECT * in your subquery, you may well be negating some or all of any performance gain. SELECT only the PK field(s).

来源：https://stackoverflow.com/questions/11591753/explanation-of-using-the-operator-exists-on-a-correlated-subqueries

标签

sql

sql-server

exists