问题
I'm trying to remove duplicate rows from a table with millions of rows. The field I want to check for duplicates on is too long (it's storing URLs) to put a UNIQUE index on. Is there any way to remove duplicates quickly?
The recommended method for removing duplicates:
DELETE t1 FROM table1 AS t1 JOIN table1 AS t2 ON t1.id>t2.id AND t1.name=t2.name;
Never seems to finish its job, though I suppose it might just require a lot of time to do.
One idea I've heard here is to create an MD5 hash column for indexing and comparison. Is this the recommended route? If so, should I be truncating this column for space/speed considerations?
回答1:
The hash would give you a column you could put and index on so then t1.Name = t2.Name would be a far less expensive t1.Hash = t2.Hash. Adding the hash to 1,000,000 records would take a while though.
Another option if this is a one off would be to page the clear out e.g. something like
Where T1 >= 0 and T1 < 10000
Where T1 >= 10001 and T1 < 20000
etc.
回答2:
I may not have thought this through, but it's worth a try:
- Create a column,
md5url, and fill it with the md5 of the url (UPDATE table1 SET md5url = MD5(url) - Make a (non-unique) index on that column
md5url- md5 should be short enough to do so change your statement to:
DELETE t1 FROM table1 AS t1 JOIN table1 AS t2 ON t1.md5url = t2.md5url AND t1.name=t2.name AND t1.id>t2.id;
That way, the JOIN condition works primarily on an index. If that indexed column, md5url, fits, then we actually check the URL - because, depending on how many urls you have, it becomes possible at some point, that two URLs may have the same MD5. The third condition is clear - ensuring that only one of the two identical columns is deleted.
I would love to hear whether this works - it makes perfect sense in my head right now ;-)
来源:https://stackoverflow.com/questions/10746220/removing-duplicates-from-mysql-on-a-column-thats-too-big-to-index