Query equivalence with DISTINCT

问题

Let us have a simple table order(id: int, category: int, order_date: int) created using the following script

IF OBJECT_ID('dbo.orders', 'U') IS NOT NULL  DROP TABLE dbo.orders

SELECT TOP 1000000
      NEWID() id, 
      ABS(CHECKSUM(NEWID())) % 100 category, 
      ABS(CHECKSUM(NEWID())) % 10000 order_date
INTO orders
FROM    sys.sysobjects
CROSS JOIN sys.all_columns

Now, I have two equivalent queries (at least I believe that they are equivalent):

-- Q1
select distinct o1.category,
       (select count(*) from orders o2 where order_date = 1 and o1.category = o2.category) 
from orders o1

-- Q2
select o1.category,
       (select count(*) from orders o2 where order_date = 1 and o1.category = o2.category) 
from (select distinct category from orders) o1

However, when I run those queries they have a significantly different characteristic. The Q2 is twice faster for my data and it is clearly caused by the fact that the query plan first find unique categories (hash match in the following query plans) before the join.

The difference is still there if add requested index

CREATE NONCLUSTERED INDEX ix_order_date ON orders(order_date)
INCLUDE (category)

Moreover, the Q2 can use efficiently also the following index, whereas, the Q1 remains the same:

CREATE NONCLUSTERED INDEX ix_orders_kat ON orders(category, order_date)

My question are:

Are these queries equivalent?
If yes, what is the obstacle for the SQL Server 2016 query optimizer to find the second query plan in the case of Q1 (I believe that the search space must be quite small in this case)?
If no, could you post a counter example?

EDIT

My motivation for the question is that I would like to understand why query optimizers are so poor in rewriting even simple queries and they rely on SQL syntax so heavily. SQL language is a declarative language, therefore, why SQL query processors are driven by syntax so often even for simple queries like this?

回答1:

The queries are functionally equivalent, meaning that they should return the same data.

However, they are interpreted differently by the SQL engine. The first (SELECT DISTINCT) generates all the results and then removes the duplicates.

The second extracts the distinct values first, so the subquery is only called on the appropriate subset.

An index might make either query more efficient, but it won't fundamentally affect whether the distinct processing occurs before or after the subquery.

In this case, the results are the same. However, that is not necessarily true depending on the subquery.

来源：https://stackoverflow.com/questions/48227588/query-equivalence-with-distinct

标签

sql

performance

query-optimization

sql-server-2016