Delete duplicates from large dataset (>100Mio rows)

后端未结

关注

 2  1047

悲哀的现实 2021-01-01 02:21

I know that this topic came up many times before here but none of the suggested solutions worked for my dataset because my laptop stopped calculating due to memory issues or

2条回答

心在旅途 (楼主)

2021-01-01 02:56
In general, the fastest way to delete duplicates from a table is to insert the records -- without duplicates -- into a temporary table, truncate the original table and insert them back in.

Here is the idea, using SQL Server syntax:
```
select distinct t.*
into #temptable
from t;

truncate table t;

insert into t
    select tt.*
    from #temptable;
```
Of course, this depends to a large extent on how fast the first step is. And, you need to have the space to store two copies of the same table.

Note that the syntax for creating the temporary table differs among databases. Some use the syntax of create table as rather than select into.

EDIT:

Your identity insert error is troublesome. I think you need to remove the identity from the list of columns for the distinct. Or do:
```
select min(), 
from t
group by 
```
If you have an identity column, then there are no duplicates (by definition).

In the end, you will need to decide which id you want for the rows. If you can generate a new id for the rows, then just leave the identity column out of the column list for the insert:
```
insert into t()
    select ;
```
If you need the old identity value (and the minimum will do), turn off identity insert and do:
```
insert into t()
    select ;
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...