How to quickly select DISTINCT dates from a Date/Time field, SQL Server

问题

I am wondering if there is a good-performing query to select distinct dates (ignoring times) from a table with a datetime field in SQL Server.

My problem isn't getting the server to actually do this (I've seen this question already, and we had something similar already in place using DISTINCT). The problem is whether there is any trick to get it done more quickly. With the data we are using, our current query is returning ~80 distinct days for which there are ~40,000 rows of data (after filtering on another indexed column), there is an index on the date column, and the query always manages to take 5+ seconds. Which is too slow.

Changing the database structure might be an option, but a less desirable one.

回答1:

Every option that involves CAST or TRUNCATE or DATEPART manipulation on the datetime field has the same problem: the query has to scan the entire resultset (the 40k) in order to find the distinct dates. Performance may vary marginally between various implementaitons.

What you really need is to have an index that can produce the response in a blink. You can either have a persisted computed column with and index that (requires table structure changes) or an indexed view (requires Enterprise Edition for QO to consider the index out-of-the-box).

Persisted computed column:

alter table foo add date_only as convert(char(8), [datetimecolumn], 112) persisted;
create index idx_foo_date_only on foo(date_only);

Indexed view:

create view v_foo_with_date_only
with schemabinding as 
select id
    , convert(char(8), [datetimecolumn], 112) as date_only
from dbo.foo;   
create unique clustered index idx_v_foo on v_foo_with_date_only(date_only, id);

Update

To completely eliminate the scan one could use an GROUP BY tricked indexed view, like this:

create view v_foo_with_date_only
with schemabinding as 
select
    convert(char(8), [d], 112) as date_only
    , count_big(*) as [dummy]
from dbo.foo
group by convert(char(8), [d], 112)

create unique clustered index idx_v_foo on v_foo_with_date_only(date_only)

The query select distinct date_only from foo will use this indexed view instead. Is still a scan technically, but on an already 'distinct' index, so only the needed records are scanned. Its a hack, I reckon, I would not recommend it for live production code.

AFAIK SQL Server does not have the capability of scanning a true index with skipping repeats, ie. seek top, then seek greater than top, then succesively seek greater than last found.

回答2:

I've used the following:

CAST(FLOOR(CAST(@date as FLOAT)) as DateTime);

This removes the time from the date by converting it to a float and truncating off the "time" part, which is the decimal of the float.

Looks a little clunky but works well on a large dataset (~100,000 rows) I use repeatedly throughout the day.

回答3:

This works for me:

SELECT distinct(CONVERT(varchar(10), {your date column}, 111)) 
FROM {your table name}

回答4:

The simplest way is to add a computed column for just the date portion, and select on that. You could do this in a view if you don't want to change the table.

回答5:

I'm not sure why your existing query would take over 5s for 40,000 rows.

I just tried the following query against a table with 100,000 rows and it returned in less than 0.1s.

SELECT DISTINCT DATEADD(day, 0, DATEDIFF(day, 0, your_date_column))
FROM your_table

(Note that this query probably won't be able to take advantage of any indexes on the date column, but it should be reasonably quick, assuming that you're not executing it dozens of times per second.)

回答6:

Update:

Solution below tested for efficiency on a 2M table and takes but 40 ms.

Plain DISTINCT on an indexed computed column took 9 seconds.

See this entry in my blog for performance details:

SQL Server: efficient DISTINCT on dates

Unfortunately, SQL Server's optimizer can do neither Oracle's SKIP SCAN nor MySQL's INDEX FOR GROUP-BY.

It's always Stream Aggregate that takes long.

You can built a list of possible dates using a recursive CTE and join it with your table:

WITH    rows AS (
        SELECT  CAST(CAST(CAST(MIN(date) AS FLOAT) AS INTEGER) AS DATETIME) AS mindate, MAX(date) AS maxdate
        FROM    mytable
        UNION ALL
        SELECT  mindate + 1, maxdate
        FROM    rows
        WHERE   mindate < maxdate
        )
SELECT  mindate
FROM    rows
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    mytable
        WHERE   date >= mindate
                AND date < mindate + 1
        )
OPTION  (MAXRECURSION 0)

This will be more efficient than Stream Aggregate

回答7:

I used this

SELECT
DISTINCT DATE_FORMAT(your_date_column,'%Y-%m-%d') AS date
FROM ...

回答8:

What is your predicate on that other filtered column ? Have you tried whether you get improvement from an index on that other filtered column, followed by the datetime field ?

I'm largely guessing here, but 5 seconds to filter a set of perhaps 100000 rows down to 40000 and then doing a sort (which is presumably what goes on) doesn't seem like an unreasonable time to me. Why do you say it's too slow ? Because it doesn't match expectations ?

回答9:

Just convert the date: dateadd(dd,0, datediff(dd,0,[Some_Column]))

回答10:

If you want to avoid the step extraction or reformatting the date - which is presumably the main cause of the delay (by forcing a full table scan) - you've no alternative but to store the date only part of the datetime, which unfortunately will require an alteration to the database structure.

If your using SQL Server 2005 or later then a persisted computed field is the way to go

Unless otherwise specified, computed columns are virtual columns that are
not physically stored in the table. Their values are recalculated every 
time they are referenced in a query. The Database Engine uses the PERSISTED 
keyword in the CREATE TABLE and ALTER TABLE statements to physically store 
computed columns in the table. Their values are updated when any columns 
that are part of their calculation change. By marking a computed column as 
PERSISTED, you can create an index on a computed column that is deterministic
but not precise.

来源：https://stackoverflow.com/questions/1307393/how-to-quickly-select-distinct-dates-from-a-date-time-field-sql-server

标签

sql-server

datetime

distinct