Removing duplicates from MySQL on a column that's too big to INDEX

大城市里の小女人 提交于 2019-12-25 03:44:19

问题


I'm trying to remove duplicate rows from a table with millions of rows. The field I want to check for duplicates on is too long (it's storing URLs) to put a UNIQUE index on. Is there any way to remove duplicates quickly?

The recommended method for removing duplicates:

DELETE t1 FROM table1 AS t1 JOIN table1 AS t2 ON t1.id>t2.id AND t1.name=t2.name;

Never seems to finish its job, though I suppose it might just require a lot of time to do.

One idea I've heard here is to create an MD5 hash column for indexing and comparison. Is this the recommended route? If so, should I be truncating this column for space/speed considerations?


回答1:


The hash would give you a column you could put and index on so then t1.Name = t2.Name would be a far less expensive t1.Hash = t2.Hash. Adding the hash to 1,000,000 records would take a while though.

Another option if this is a one off would be to page the clear out e.g. something like

Where T1 >= 0 and T1 < 10000
Where T1 >= 10001 and T1 < 20000

etc.




回答2:


I may not have thought this through, but it's worth a try:

  • Create a column, md5url, and fill it with the md5 of the url (UPDATE table1 SET md5url = MD5(url)
  • Make a (non-unique) index on that column md5url - md5 should be short enough to do so
  • change your statement to:

    DELETE t1 
    FROM table1 AS t1 
         JOIN table1 AS t2 
              ON t1.md5url = t2.md5url 
                 AND t1.name=t2.name 
                 AND t1.id>t2.id;
    

That way, the JOIN condition works primarily on an index. If that indexed column, md5url, fits, then we actually check the URL - because, depending on how many urls you have, it becomes possible at some point, that two URLs may have the same MD5. The third condition is clear - ensuring that only one of the two identical columns is deleted.

I would love to hear whether this works - it makes perfect sense in my head right now ;-)



来源:https://stackoverflow.com/questions/10746220/removing-duplicates-from-mysql-on-a-column-thats-too-big-to-index

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!