Removing duplicate email address based on the lowest id in mysql

问题

I have a table called emaildata consisting of 4 columns emailaddress, domainname, data and id.

The emailaddress column should contain only unique entries, but there are many duplicates. The domainname and data column are not unique, and as such will contain duplicates which is fine. The id column is set to autoincrement so will contain only unique values.

My question is how do I get rid of all rows that feature duplicate email addresses, keeping the one with the lowest id?

There should be around 370,000 rows, but currently I've got 906,000.

I had an SQL statement before which I used for a similar table and I've tried to adapt it to this without success.

delete T1
from emaildata T1, emaildata T2
where T1.emailaddress = T2.emailaddress
and T1.id > T2.id

The above was based upon the following which was applied to another table and worked fine.

delete T1
from email_list_subscribers T1, email_list_subscribers T2
where T1.emailaddress = T2.emailaddress
and T1.subscriberid > T2.subscriberid

I've tried running this against my table on the remote server in phpmyadmin and after pressing the GO button, the loading bar comes up in the middle, then disappears as if it is processing - but it never does.

I've tried repeating this against the same table running on my home server (XAMPP) via phpmyadmin, and again with HeidiSQL - the same problem with phpmyadmin and Heidi appears to crash.

I've tried other soloutions that i've seen on here but I seem to be getting the same "timeout" / crash problem. I never had issues with the original statement running on the remote server, granted this was against a database a third of the size.

Any info would be appreciated.

回答1:

Your query appears to be correct. Your issue seems to be a performance issue, not a logic issue. You'll need to make sure that both your emailaddress and id fields are properly indexed in the database - otherwise with close to a million rows, I would expect your query to hang.

(I would guess that id is probably already indexed, but not emailaddress. Especially with doing a join between tables, if either one of these fields is not indexed, you're going to be looking at a LOT of full table scans.)

Edit:

Seeing your comment that this is the case, you can follow the documentation at http://dev.mysql.com/doc/refman/5.0/en/create-index.html for creating indices. So something like:

CREATE INDEX email_index ON emaildata(emailaddress) USING BTREE;

回答2:

Never tryed to see if array_unique (php function) ever modifies the key but here is how u can do...

select id and email and store them in an array like id => email

after use array_unique to get a new array with the first id for each repetiteve group...thats how function works... and after comapre the 2 arrays and delete the remaining ids from ur table...

this way u get the first id from repetitive groups and unique values

来源：https://stackoverflow.com/questions/8680477/removing-duplicate-email-address-based-on-the-lowest-id-in-mysql

标签

mysql

sql

phpMyAdmin

duplicate-removal