问题

I have a data set that I need to filter a date that is stored as a string (changing the source column to a DateTime is NOT a option, this data is coming from a 3rd party source that I can not control).

One of the dates is malformed so if I do the following query I get one result

select ClientID, StartDate from boarding_appts where isdate(StartDate) = 0

ClientID   StartDate
---------- --------------------
5160       5/6/210 12:00:00

If I do a cast(StartDate as datetime) I get "Arithmetic overflow error converting expression to data type datetime.", which I expected. and if I filter by IsDate alone everything works fine

select ClientID, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age  from boarding_appts where isdate(StartDate) = 1

ClientID   StartDate               age
---------- ----------------------- ----------
10207      2012-06-09 12:00:00.000 1
2843       2012-06-23 12:00:00.000 1
2843       2012-06-23 12:00:00.000 1
8292       2012-05-11 12:00:00.000 1
7935       2012-04-24 12:00:00.000 1
... (1000's of more rows) ...

Here is my problem:

I want to filter out records so only records a year old or newer show up, however no-matter how I attempt to perform the filter every one of these queries give me an arithmetic overflow error.

select ClientID, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age  
from boarding_appts 
where isdate(StartDate) = 1 
    and datediff(year, cast(StartDate as dateTime), getdate()) < 1 --If you comment out this line it works fine

select * 
from (select ClientID, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age  from boarding_appts where isdate(StartDate) = 1) as Filtered
where age < 1 --If you comment out this line it works fine

select * 
from (select ClientID, cast(StartDate as dateTime) as StartDateCast  from boarding_appts where isdate(StartDate) = 1) as Filtered
where datediff(year, StartDateCast, getdate()) < 1 --If you comment out this line it works fine

;with Filtered as
(select ClientID, cast(StartDate as dateTime) as StartDateCast  from boarding_appts where isdate(StartDate) = 1)
select * from Filtered
where datediff(year, StartDateCast, getdate()) < 1 --If you comment out this line it works fine

;with Filtered as
(select ClientID, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age  from boarding_appts where isdate(StartDate) = 1)
select * from Filtered
where age < 1 --If you comment out this line it works fine

Here is a test set of data on SQL Fiddle for you to try out any solutions on. I am out of ideas on how to fix this. The ONLY solution I could think of that worked was selecting in to a temporary table first then selecting it out

select ClientID, StartDate, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age  
into #t
from boarding_appts 
where isdate(StartDate) = 1 

select * from #t where age < 1 --Works.

回答1:

SQL is a declarative language. The SQL optimizer is free to rearrange parts of the where clause as long as it retains its original meaning. So it can run datediff before isdate even if you specify isdate first. A subquery or CTE provides no sure relief, since that too can be rewritten.

The second suggestion from Aaron Bertrand in the comments:

WHERE   CASE ISDATE(StartDate) 
        WHEN 1 THEN StartDate 
        ELSE '19000101' 
        END >= DATEADD(YEAR, -1, GETDATE());

Makes it unlikely that SQL Server will cast StartDate to a datetime when ISDATE = 0. That seems like the best solution.

I've marked this answer community wiki, if Aaran Bertrand posts an answer, accept that :)

回答2:

SQL Server's DateTime has the domain 1753-01-01 00:00:00.000 ≤ x ≤ 9999-12-31 23:59:59.997. The year 210 CE is outside that domain. Hence the problem.

If you were using SQL Server 2008 or later, you could cast it to a DateTime2 datatype and you'd be golden (its domain is 0001-01-01 00:00:00.0000000 &le x ≤ 9999-12-31 23:59:59.9999999. But with SQL Server 2005, you're pretty much SOL.

This is really a problem of data cleaning. My inclination in cases like this is to load the 3rd party data into a staging table with each field as character strings. Then cleanse the data in place, replacing, for instance, invalid dates with NULL. Once cleansed, then do the necessary conversion work to move it to its final destination.

Another approach is to use pattern matching and do the date filtering without converting anything to datetime. ISO 8601 date/time values are character strings that have the laudable property of being (A) human-readable and (B) collating and comparing properly.

What I've done in the past is some analytical work to identify all the patterns in the datetime field by replacing decimal digits with a 'd' and then running group by to compute the counts of each different pattern found. Once you have that you can create some pattern tables to guide you. Something like these:

create table #datePattern
(
  pattern varchar(64) not null primary key clustered ,
  monPos  int         not null ,
  monLen  int         not null ,
  dayPos  int         not null ,
  dayLen  int         not null ,
  yearPos int         not null ,
  yearLen int         not null ,
)

insert #datePattern values ( '[0-9]/[0-9]/[0-9] %'                          ,1,1,3,1,5,1)
insert #datePattern values ( '[0-9]/[0-9]/[0-9][0-9] %'                     ,1,1,3,1,5,2)
insert #datePattern values ( '[0-9]/[0-9]/[0-9][0-9][0-9] %'                ,1,1,3,1,5,3)
insert #datePattern values ( '[0-9]/[0-9]/[0-9][0-9][0-9][0-9] %'           ,1,1,3,1,5,4)
insert #datePattern values ( '[0-9]/[0-9][0-9]/[0-9] %'                     ,1,1,3,2,6,1)
insert #datePattern values ( '[0-9]/[0-9][0-9]/[0-9][0-9] %'                ,1,1,3,2,6,2)
insert #datePattern values ( '[0-9]/[0-9][0-9]/[0-9][0-9][0-9] %'           ,1,1,3,2,6,3)
insert #datePattern values ( '[0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] %'      ,1,1,3,2,6,4)
insert #datePattern values ( '[0-9][0-9]/[0-9]/[0-9] %'                     ,1,2,4,1,6,1)
insert #datePattern values ( '[0-9][0-9]/[0-9]/[0-9][0-9] %'                ,1,2,4,1,6,2)
insert #datePattern values ( '[0-9][0-9]/[0-9]/[0-9][0-9][0-9] %'           ,1,2,4,1,6,3)
insert #datePattern values ( '[0-9][0-9]/[0-9]/[0-9][0-9][0-9][0-9] %'      ,1,2,4,1,6,4)
insert #datePattern values ( '[0-9][0-9]/[0-9][0-9]/[0-9] %'                ,1,2,4,2,7,1)
insert #datePattern values ( '[0-9][0-9]/[0-9][0-9]/[0-9][0-9] %'           ,1,2,4,2,7,2)
insert #datePattern values ( '[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9] %'      ,1,2,4,2,7,3)
insert #datePattern values ( '[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] %' ,1,2,4,2,7,4)

create table #timePattern
(
  pattern varchar(64) not null primary key clustered ,
  hhPos int not null ,
  hhLen int not null ,
  mmPos int not null ,
  mmLen int not null ,
  ssPos int not null ,
  ssLen int not null ,
)
insert #timePattern values ( '[0-9]:[0-9]:[0-9]'                ,1,1,3,1,5,1 )
insert #timePattern values ( '[0-9]:[0-9]:[0-9][0-9]'           ,1,1,3,1,5,2 )
insert #timePattern values ( '[0-9]:[0-9][0-9]:[0-9]'           ,1,1,3,2,6,1 )
insert #timePattern values ( '[0-9]:[0-9][0-9]:[0-9][0-9]'      ,1,1,3,2,6,2 )
insert #timePattern values ( '[0-9][0-9]:[0-9]:[0-9]'           ,1,2,4,1,6,1 )
insert #timePattern values ( '[0-9][0-9]:[0-9]:[0-9][0-9]'      ,1,2,4,1,6,2 )
insert #timePattern values ( '[0-9][0-9]:[0-9][0-9]:[0-9]'      ,1,2,4,2,7,1 )
insert #timePattern values ( '[0-9][0-9]:[0-9][0-9]:[0-9][0-9]' ,1,2,4,2,7,2 )

You could combine these two tables into 1 but the number of combinations tends to explode things, though it greatly simplifies the query then.

Once you have that, the query is [fairly] easy, given that SQL is not exactly the world's best language choice for string processing:

---------------------------------------------------------------------
-- first, get your lower bound in ISO 8601 format yyyy-mm-dd hh:mm:ss
-- This will compare/collate properly
---------------------------------------------------------------------
declare @dtLowerBound varchar(255)
set @dtLowerBound = convert(varchar,dateadd(year,-1,current_timestamp),121)

-----------------------------------------------------------------
-- select rows with a start date more recent than the lower bound
-----------------------------------------------------------------
select isoDate =       + right( '0000' + substring( t.startDate , coalesce(dt.yearPos,1) , coalesce(dt.YearLen,0) ) , 4 )
                 + '-' + right(   '00' + substring( t.startDate , coalesce(dt.monPos,1)  , coalesce(dt.MonLen,0)  ) , 2 )
                 + '-' + right(   '00' + substring( t.startDate , coalesce(dt.dayPos,1)  , coalesce(dt.dayLen,0)  ) , 2 )
                 + case
                   when tm.pattern is not null then
                       ' ' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.hhPos , tm.hhLen ) , 2 )
                     + ':' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.mmPos , tm.mmLen ) , 2 )
                     + ':' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.ssPos , tm.ssLen ) , 2 )
                   else ''
                   end
,*
from someTableWithBadData t
left join #datePattern dt on t.startDate like dt.pattern
left join #timePattern tm on ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) )
                             like tm.pattern
where @lowBound <=        + right( '0000' + substring( t.startDate , coalesce(dt.yearPos,1) , coalesce(dt.YearLen,0) ) , 4 )
                 + '-' + right(   '00' + substring( t.startDate , coalesce(dt.monPos,1)  , coalesce(dt.MonLen,0)  ) , 2 )
                 + '-' + right(   '00' + substring( t.startDate , coalesce(dt.dayPos,1)  , coalesce(dt.dayLen,0)  ) , 2 )
                 + case
                   when tm.pattern is not null then
                       ' ' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.hhPos , tm.hhLen ) , 2 )
                     + ':' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.mmPos , tm.mmLen ) , 2 )
                     + ':' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.ssPos , tm.ssLen ) , 2 )
                   else ''
                   end

Like I said, SQL not the best choice for munging strings.

This should get you ... 90% there. Experience tells me that you'll still find more bad dates: months less than 1 or greater than 12 , days less than 1 or greater than 31, or days out of range for that month (nothing like February 31st to make the computer whine), etc. Old cobol programs in particular, loved to use a field of all 9s to indicate missing data, for instance (though that is an easy case to deal with).

My preferred technique is to write a perl script to scrub the data and bulk load it into SQL Server, using perl's BCP facilities. That's exactly the sort of problem space perl is designed for.

来源：https://stackoverflow.com/questions/16487264/i-still-get-a-arithmetic-overflow-when-i-filter-on-a-cast-datetime-even-if-i-u

标签

sql