What is the fastest procedure to remove duplicates from a big table in MySQL

问题

I have a table in MySQL (50 million rows) new data keep inserting periodically.

This table has following structure

CREATE TABLE values (
    id double NOT NULL AUTO_INCREMENT,
    channel_id int(11) NOT NULL,
    val text NOT NULL,
    date_time datetime NOT NULL,
    PRIMARY KEY (id),
    KEY channel_date_index (channel_id,date_time)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;

Two rows must never have duplicate channel_id and date_time, but if such insert occurs it is important to keep the newest value.

Is there a procedure to check for duplicates realtime before the insert or should I keep inserting all data while doing periodic checks for duplicity in a different cycle.

Realtime speed is important here, because 100 inserts occur per second.

回答1:

To prevent future duplicates:

Change KEY channel_date_index (channel_id,date_time) to UNIQUE (channel_id,date_time)
Change the INSERT to INSERT ... ON DUPLICATE KEY UPDATE ... to change the timestamp when that pair exists.

To fix the existing table, you could do ALTER IGNORE TABLE ... ADD UNIQUE(...). However that would not give you the latest timestamps.

For minimum downtime (not maximum speed), use pt-online-schema-change.

来源：https://stackoverflow.com/questions/29350826/what-is-the-fastest-procedure-to-remove-duplicates-from-a-big-table-in-mysql

标签

mysql

insert

bigdata

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!