How to prevent duplicate records being inserted with SqlBulkCopy when there is no primary key

試著忘記壹切 提交于 2019-11-27 22:07:17

I'd upload the data into a staging table then deal with duplicates afterwards on copy to the final table.

For example, you can create a (non-unique) index on the staging table to deal with the "key"

Given that you're using SQL 2008, you have two options to solve the problem easily without needing to change your application much (if at all).

The first possible solution is create a second table like the first one but with a surrogate identity key and a uniqueness constraint added using the ignore_dup_key option which will do all the heavy lifting of eliminating the duplicates for you.

Here's an example you can run in SSMS to see what's happening:

if object_id( 'tempdb..#test1' ) is not null drop table #test1;
if object_id( 'tempdb..#test2' ) is not null drop table #test2;
go


-- example heap table with duplicate record

create table #test1
(
     col1 int
    ,col2 varchar(50)
    ,col3 char(3)
);
insert #test1( col1, col2, col3 )
values
     ( 250, 'Joe''s IT Consulting and Bait Shop', null )
    ,( 120, 'Mary''s Dry Cleaning and Taxidermy', 'ACK' )
    ,( 250, 'Joe''s IT Consulting and Bait Shop', null )    -- dup record
    ,( 666, 'The Honest Politician', 'LIE' )
    ,( 100, 'My Invisible Friend', 'WHO' )
;
go


-- secondary table for removing duplicates

create table #test2
(
     sk int not null identity primary key
    ,col1 int
    ,col2 varchar(50)
    ,col3 char(3)

    -- add a uniqueness constraint to filter dups
    ,constraint UQ_test2 unique ( col1, col2, col3 ) with ( ignore_dup_key = on )
);
go


-- insert all records from original table
-- this should generate a warning if duplicate records were ignored

insert #test2( col1, col2, col3 )
select col1, col2, col3
from #test1;
go

Alternatively, you can also remove the duplicates in-place without a second table, but the performance may be too slow for your needs. Here's the code for that example, also runnable in SSMS:

if object_id( 'tempdb..#test1' ) is not null drop table #test1;
go


-- example heap table with duplicate record

create table #test1
(
     col1 int
    ,col2 varchar(50)
    ,col3 char(3)
);
insert #test1( col1, col2, col3 )
values
     ( 250, 'Joe''s IT Consulting and Bait Shop', null )
    ,( 120, 'Mary''s Dry Cleaning and Taxidermy', 'ACK' )
    ,( 250, 'Joe''s IT Consulting and Bait Shop', null )    -- dup record
    ,( 666, 'The Honest Politician', 'LIE' )
    ,( 100, 'My Invisible Friend', 'WHO' )
;
go


-- add temporary PK and index

alter table #test1 add sk int not null identity constraint PK_test1 primary key clustered;
create index IX_test1 on #test1( col1, col2, col3 );
go


-- note: rebuilding the indexes may or may not provide a performance benefit

alter index PK_test1 on #test1 rebuild;
alter index IX_test1 on #test1 rebuild;
go


-- remove duplicates

with ranks as
(
    select
         sk
        ,ordinal = row_number() over 
         ( 
            -- put all the columns composing uniqueness into the partition
            partition by col1, col2, col3
            order by sk
         )
    from #test1
)
delete 
from ranks
where ordinal > 1;
go


-- remove added columns

drop index IX_test1 on #test1;
alter table #test1 drop constraint PK_test1;
alter table #test1 drop column sk;
go

I would bulk copy into a temporary table and then push the data from that into the actual destination table. In this way, you can use SQL to check for and handle duplicates.

Why not simply use, instead of a Primary Key, create an Index and set

Ignore Duplicate Keys: YES

This will prevent any duplicate key to fire an error, and it will not be created (as it exists already).

I use this method to insert around 120.000 rows per day and works flawlessly.

What is the data volume? You have 2 options that I can see:

1: filter it at source, by implementing your own IDataReader and using some hash over the data, and simply skipping any duplicates so that they never get passed into the TDS.

2: filter it in the DB; at the simplest level, I guess you could have multiple stages of import - the raw, unsanitised data - and then copy the DISTINCT data into your actual tables, perhaps using an intermediate table if you want to. You might want to use CHECKSUM for some of this, but it depends.

John

I think this is a lot cleaner.

var dtcolumns = new string[] { "Col1", "Col2", "Col3"};

var dtDistinct = dt.DefaultView.ToTable(true, dtcolumns);

using (SqlConnection cn = new SqlConnection(cn) 
{
                copy.ColumnMappings.Add(0, 0);
                copy.ColumnMappings.Add(1, 1);
                copy.ColumnMappings.Add(2, 2);
                copy.DestinationTableName = "TableNameToMapTo";
                copy.WriteToServer(dtDistinct );

}

This way only need one database table and can keep Bussiness Logic in code.

And fix that table. No table ever should be without a unique index, preferably as a PK. Even if you add a surrogate key because there is no natural key, you need to be able to specifically identify a particular record. Otherwise how will you get rid of the duplicates you already have?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!