问题
I have a data set that I need to filter a date that is stored as a string (changing the source column to a DateTime is NOT a option, this data is coming from a 3rd party source that I can not control).
One of the dates is malformed so if I do the following query I get one result
select ClientID, StartDate from boarding_appts where isdate(StartDate) = 0
ClientID StartDate
---------- --------------------
5160 5/6/210 12:00:00
If I do a cast(StartDate as datetime)
I get "Arithmetic overflow error converting expression to data type datetime.", which I expected. and if I filter by IsDate
alone everything works fine
select ClientID, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age from boarding_appts where isdate(StartDate) = 1
ClientID StartDate age
---------- ----------------------- ----------
10207 2012-06-09 12:00:00.000 1
2843 2012-06-23 12:00:00.000 1
2843 2012-06-23 12:00:00.000 1
8292 2012-05-11 12:00:00.000 1
7935 2012-04-24 12:00:00.000 1
... (1000's of more rows) ...
Here is my problem:
I want to filter out records so only records a year old or newer show up, however no-matter how I attempt to perform the filter every one of these queries give me an arithmetic overflow error.
select ClientID, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age
from boarding_appts
where isdate(StartDate) = 1
and datediff(year, cast(StartDate as dateTime), getdate()) < 1 --If you comment out this line it works fine
select *
from (select ClientID, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age from boarding_appts where isdate(StartDate) = 1) as Filtered
where age < 1 --If you comment out this line it works fine
select *
from (select ClientID, cast(StartDate as dateTime) as StartDateCast from boarding_appts where isdate(StartDate) = 1) as Filtered
where datediff(year, StartDateCast, getdate()) < 1 --If you comment out this line it works fine
;with Filtered as
(select ClientID, cast(StartDate as dateTime) as StartDateCast from boarding_appts where isdate(StartDate) = 1)
select * from Filtered
where datediff(year, StartDateCast, getdate()) < 1 --If you comment out this line it works fine
;with Filtered as
(select ClientID, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age from boarding_appts where isdate(StartDate) = 1)
select * from Filtered
where age < 1 --If you comment out this line it works fine
Here is a test set of data on SQL Fiddle for you to try out any solutions on. I am out of ideas on how to fix this. The ONLY solution I could think of that worked was selecting in to a temporary table first then selecting it out
select ClientID, StartDate, cast(StartDate as dateTime) as StartDateCast, datediff(year, cast(StartDate as dateTime), getdate()) as age
into #t
from boarding_appts
where isdate(StartDate) = 1
select * from #t where age < 1 --Works.
回答1:
SQL is a declarative language. The SQL optimizer is free to rearrange parts of the where
clause as long as it retains its original meaning. So it can run datediff
before isdate
even if you specify isdate
first. A subquery or CTE provides no sure relief, since that too can be rewritten.
The second suggestion from Aaron Bertrand in the comments:
WHERE CASE ISDATE(StartDate)
WHEN 1 THEN StartDate
ELSE '19000101'
END >= DATEADD(YEAR, -1, GETDATE());
Makes it unlikely that SQL Server will cast StartDate
to a datetime when ISDATE = 0
. That seems like the best solution.
I've marked this answer community wiki, if Aaran Bertrand posts an answer, accept that :)
回答2:
SQL Server's DateTime
has the domain 1753-01-01 00:00:00.000 ≤ x ≤ 9999-12-31 23:59:59.997. The year 210 CE is outside that domain. Hence the problem.
If you were using SQL Server 2008 or later, you could cast it to a DateTime2
datatype and you'd be golden (its domain is 0001-01-01 00:00:00.0000000 &le x ≤ 9999-12-31 23:59:59.9999999. But with SQL Server 2005, you're pretty much SOL.
This is really a problem of data cleaning. My inclination in cases like this is to load the 3rd party data into a staging table with each field as character strings. Then cleanse the data in place, replacing, for instance, invalid dates with NULL. Once cleansed, then do the necessary conversion work to move it to its final destination.
Another approach is to use pattern matching and do the date filtering without converting anything to datetime
. ISO 8601 date/time values are character strings that have the laudable property of being (A) human-readable and (B) collating and comparing properly.
What I've done in the past is some analytical work to identify all the patterns in the datetime field by replacing decimal digits with a 'd' and then running group by
to compute the counts of each different pattern found. Once you have that you can create some pattern tables to guide you. Something like these:
create table #datePattern
(
pattern varchar(64) not null primary key clustered ,
monPos int not null ,
monLen int not null ,
dayPos int not null ,
dayLen int not null ,
yearPos int not null ,
yearLen int not null ,
)
insert #datePattern values ( '[0-9]/[0-9]/[0-9] %' ,1,1,3,1,5,1)
insert #datePattern values ( '[0-9]/[0-9]/[0-9][0-9] %' ,1,1,3,1,5,2)
insert #datePattern values ( '[0-9]/[0-9]/[0-9][0-9][0-9] %' ,1,1,3,1,5,3)
insert #datePattern values ( '[0-9]/[0-9]/[0-9][0-9][0-9][0-9] %' ,1,1,3,1,5,4)
insert #datePattern values ( '[0-9]/[0-9][0-9]/[0-9] %' ,1,1,3,2,6,1)
insert #datePattern values ( '[0-9]/[0-9][0-9]/[0-9][0-9] %' ,1,1,3,2,6,2)
insert #datePattern values ( '[0-9]/[0-9][0-9]/[0-9][0-9][0-9] %' ,1,1,3,2,6,3)
insert #datePattern values ( '[0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] %' ,1,1,3,2,6,4)
insert #datePattern values ( '[0-9][0-9]/[0-9]/[0-9] %' ,1,2,4,1,6,1)
insert #datePattern values ( '[0-9][0-9]/[0-9]/[0-9][0-9] %' ,1,2,4,1,6,2)
insert #datePattern values ( '[0-9][0-9]/[0-9]/[0-9][0-9][0-9] %' ,1,2,4,1,6,3)
insert #datePattern values ( '[0-9][0-9]/[0-9]/[0-9][0-9][0-9][0-9] %' ,1,2,4,1,6,4)
insert #datePattern values ( '[0-9][0-9]/[0-9][0-9]/[0-9] %' ,1,2,4,2,7,1)
insert #datePattern values ( '[0-9][0-9]/[0-9][0-9]/[0-9][0-9] %' ,1,2,4,2,7,2)
insert #datePattern values ( '[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9] %' ,1,2,4,2,7,3)
insert #datePattern values ( '[0-9][0-9]/[0-9][0-9]/[0-9][0-9][0-9][0-9] %' ,1,2,4,2,7,4)
create table #timePattern
(
pattern varchar(64) not null primary key clustered ,
hhPos int not null ,
hhLen int not null ,
mmPos int not null ,
mmLen int not null ,
ssPos int not null ,
ssLen int not null ,
)
insert #timePattern values ( '[0-9]:[0-9]:[0-9]' ,1,1,3,1,5,1 )
insert #timePattern values ( '[0-9]:[0-9]:[0-9][0-9]' ,1,1,3,1,5,2 )
insert #timePattern values ( '[0-9]:[0-9][0-9]:[0-9]' ,1,1,3,2,6,1 )
insert #timePattern values ( '[0-9]:[0-9][0-9]:[0-9][0-9]' ,1,1,3,2,6,2 )
insert #timePattern values ( '[0-9][0-9]:[0-9]:[0-9]' ,1,2,4,1,6,1 )
insert #timePattern values ( '[0-9][0-9]:[0-9]:[0-9][0-9]' ,1,2,4,1,6,2 )
insert #timePattern values ( '[0-9][0-9]:[0-9][0-9]:[0-9]' ,1,2,4,2,7,1 )
insert #timePattern values ( '[0-9][0-9]:[0-9][0-9]:[0-9][0-9]' ,1,2,4,2,7,2 )
You could combine these two tables into 1 but the number of combinations tends to explode things, though it greatly simplifies the query then.
Once you have that, the query is [fairly] easy, given that SQL is not exactly the world's best language choice for string processing:
---------------------------------------------------------------------
-- first, get your lower bound in ISO 8601 format yyyy-mm-dd hh:mm:ss
-- This will compare/collate properly
---------------------------------------------------------------------
declare @dtLowerBound varchar(255)
set @dtLowerBound = convert(varchar,dateadd(year,-1,current_timestamp),121)
-----------------------------------------------------------------
-- select rows with a start date more recent than the lower bound
-----------------------------------------------------------------
select isoDate = + right( '0000' + substring( t.startDate , coalesce(dt.yearPos,1) , coalesce(dt.YearLen,0) ) , 4 )
+ '-' + right( '00' + substring( t.startDate , coalesce(dt.monPos,1) , coalesce(dt.MonLen,0) ) , 2 )
+ '-' + right( '00' + substring( t.startDate , coalesce(dt.dayPos,1) , coalesce(dt.dayLen,0) ) , 2 )
+ case
when tm.pattern is not null then
' ' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.hhPos , tm.hhLen ) , 2 )
+ ':' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.mmPos , tm.mmLen ) , 2 )
+ ':' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.ssPos , tm.ssLen ) , 2 )
else ''
end
,*
from someTableWithBadData t
left join #datePattern dt on t.startDate like dt.pattern
left join #timePattern tm on ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) )
like tm.pattern
where @lowBound <= + right( '0000' + substring( t.startDate , coalesce(dt.yearPos,1) , coalesce(dt.YearLen,0) ) , 4 )
+ '-' + right( '00' + substring( t.startDate , coalesce(dt.monPos,1) , coalesce(dt.MonLen,0) ) , 2 )
+ '-' + right( '00' + substring( t.startDate , coalesce(dt.dayPos,1) , coalesce(dt.dayLen,0) ) , 2 )
+ case
when tm.pattern is not null then
' ' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.hhPos , tm.hhLen ) , 2 )
+ ':' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.mmPos , tm.mmLen ) , 2 )
+ ':' + right( '00' + substring(ltrim(rtrim( substring(t.startDate,dt.YearPos+dt.YearLen,1+len(t.startDate)-(dt.YearPos+dt.YearLen) ) ) ), tm.ssPos , tm.ssLen ) , 2 )
else ''
end
Like I said, SQL not the best choice for munging strings.
This should get you ... 90% there. Experience tells me that you'll still find more bad dates: months less than 1 or greater than 12 , days less than 1 or greater than 31, or days out of range for that month (nothing like February 31st to make the computer whine), etc. Old cobol programs in particular, loved to use a field of all 9s to indicate missing data, for instance (though that is an easy case to deal with).
My preferred technique is to write a perl script to scrub the data and bulk load it into SQL Server, using perl's BCP facilities. That's exactly the sort of problem space perl is designed for.
来源:https://stackoverflow.com/questions/16487264/i-still-get-a-arithmetic-overflow-when-i-filter-on-a-cast-datetime-even-if-i-u