Deleting millions of rows in MySQL

做~自己de王妃 提交于 2019-11-28 03:22:38
DELETE FROM `table`
WHERE (whatever criteria)
ORDER BY `id`
LIMIT 1000

Wash, rinse, repeat until zero rows affected. Maybe in a script that sleeps for a second or three between iterations.

I'd also recommend adding some constraints to your table to make sure that this doesn't happen to you again. A million rows, at 1000 per shot, will take 1000 repetitions of a script to complete. If the script runs once every 3.6 seconds you'll be done in an hour. No worries. Your clients are unlikely to notice.

the following deletes 1,000,000 records, one at a time.

 for i in `seq 1 1000`; do 
     mysql  -e "select id from table_name where (condition) order by id desc limit 1000 " | sed 's;/|;;g' | awk '{if(NR>1)print "delete from table_name where id = ",$1,";" }' | mysql; 
 done

you could group them together and do delete table_name where IN (id1,id2,..idN) im sure too w/o much difficulty

I had a use case of deleting 1M+ rows in the 25M+ rows Table in the MySQL. Tried different approaches like batch deletes (described above).
I've found out that the fastest way (copy of required records to new table):

  1. Create Temporary Table that holds just ids.

CREATE TABLE id_temp_table ( temp_id int);

  1. Insert ids that should be removed:

insert into id_temp_table (temp_id) select.....

  1. Create New table table_new

  2. Insert all records from table to table_new without unnecessary rows that are in id_temp_table

insert into table_new .... where table_id NOT IN (select distinct(temp_id) from id_temp_table);

  1. Rename tables

The whole process took ~1hr. In my use case simple delete of batch on 100 records took 10 mins.

I'd use mk-archiver from the excellent Maatkit utilities package (a bunch of Perl scripts for MySQL management) Maatkit is from Baron Schwartz, the author of the O'Reilly "High Performance MySQL" book.

The goal is a low-impact, forward-only job to nibble old data out of the table without impacting OLTP queries much. You can insert the data into another table, which need not be on the same server. You can also write it to a file in a format suitable for LOAD DATA INFILE. Or you can do neither, in which case it's just an incremental DELETE.

It's already built for archiving your unwanted rows in small batches and as a bonus, it can save the deleted rows to a file in case you screw up the query that selects the rows to remove.

No installation required, just grab http://www.maatkit.org/get/mk-archiver and run perldoc on it (or read the web site) for documentation.

I faced a similar problem. We had a really big table, about 500 GB in size with no partitioning and one only one index on the primary_key column. Our master was a hulk of a machine, 128 cores and 512 Gigs of RAM and we had multiple slaves too. We tried a few techniques to tackle the large-scale deletion of rows. I will list them all here from worst to best that we found-

  1. Fetching and Deleting one row at a time. This is the absolute worst that you could do. So, we did not even try this.
  2. Fetching first 'X' rows from the database using a limit query on the primary_key column, then checking the row ids to delete in the application and firing a single delete query with a list of primary_key ids. So, 2 queries per 'X' rows. Now, this approach was fine but doing this using a batch job deleted about 5 million rows in 10 minutes or so, due to which the slaves of our MySQL DB were lagged by 105 seconds. 105-second lag in 10-minute activity. So, we had to stop.
  3. In this technique, we introduced a 50 ms lag between our subsequent batch fetch and deletions of size 'X' each. This solved the lag problem but we were now deleting 1.2-1.3 million rows per 10 minutes as compared to 5 million in technique #2.
  4. Partitioning the database table and then deleting the entire partitions when not needed. This is the best solution we have but it requires a pre-partitioned table. We followed step 3 because we had a non-partitioned very old table with only indexing on the primary_key column. Creating a partition would have taken too much time and we were in a crisis mode. Here are some links related to partitioning that I found helpful- Official MySQL Reference, Oracle DB daily partitioning.

So, IMO, if you can afford to have the luxury of creating a partition in your table, go for the option #4, otherwise, you are stuck with option #3.

Do it in batches of lets say 2000 rows at a time. Commit in-between. A million rows isn't that much and this will be fast, unless you have many indexes on the table.

According to the mysql documentation, TRUNCATE TABLE is a fast alternative to DELETE FROM. Try this:

TRUNCATE TABLE table_name

I tried this on 50M rows and it was done within two mins.

Note: Truncate operations are not transaction-safe; an error occurs when attempting one in the course of an active transaction or active table lock

For us, the DELETE WHERE %s ORDER BY %s LIMIT %d answer was not an option, because the WHERE criteria was slow (a non-indexed column), and would hit master.

SELECT from a read-replica a list of primary keys that you wish to delete. Export with this kind of format:

00669163-4514-4B50-B6E9-50BA232CA5EB
00679DE5-7659-4CD4-A919-6426A2831F35

Use the following bash script to grab this input and chunk it into DELETE statements [requires bash ≥ 4 because of mapfile built-in]:

sql-chunker.sh (remember to chmod +x me, and change the shebang to point to your bash 4 executable):

#!/usr/local/Cellar/bash/4.4.12/bin/bash

# Expected input format:
: <<!
00669163-4514-4B50-B6E9-50BA232CA5EB
00669DE5-7659-4CD4-A919-6426A2831F35
!

if [ -z "$1" ]
  then
    echo "No chunk size supplied. Invoke: ./sql-chunker.sh 1000 ids.txt"
fi

if [ -z "$2" ]
  then
    echo "No file supplied. Invoke: ./sql-chunker.sh 1000 ids.txt"
fi

function join_by {
    local d=$1
    shift
    echo -n "$1"
    shift
    printf "%s" "${@/#/$d}"
}

while mapfile -t -n "$1" ary && ((${#ary[@]})); do
    printf "DELETE FROM my_cool_table WHERE id IN ('%s');\n" `join_by "','" "${ary[@]}"`
done < "$2"

Invoke like so:

./sql-chunker.sh 1000 ids.txt > batch_1000.sql

This will give you a file with output formatted like so (I've used a batch size of 2):

DELETE FROM my_cool_table WHERE id IN ('006CC671-655A-432E-9164-D3C64191EDCE','006CD163-794A-4C3E-8206-D05D1A5EE01E');
DELETE FROM my_cool_table WHERE id IN ('006CD837-F1AD-4CCA-82A4-74356580CEBC','006CDA35-F132-4F2C-8054-0F1D6709388A');

Then execute the statements like so:

mysql --login-path=master billing < batch_1000.sql

For those unfamiliar with login-path, it's just a shortcut to login without typing password in the command line.

I think the slowness is due to MySQl's "clustered index" where the actual records are stored within the primary key index - in the order of the primary key index. This means access to a record via the primary key is extremely fast because it only requires one disk fetch because the record on the disk right there where it found the correct primary key in the index.

In other databases without clustered indexes the index itself does not hold the record but just an "offset" or "location" indicating where the record is located in the table file and then a second fetch must be made in that file to retrieve the actual data.

You can imagine when deleting a record in a clustered index that all records above that record in the table must be moved downwards to avoid massive holes being created in the index (well that is what I recall from a few years ago at least - later versions may have changed this).

Knowing the above what we found that really sped deletes up in MySQL was to perform the deletes in reverse order. This produces the least amount of record movement because you are delete records from the end first meaning that subsequent deletes have less objects to relocate.

I have not scripted anything to do this, and doing it properly would absolutely require a script, but another option is to create a new, duplicate table and select all the rows you want to keep into it. Use a trigger to keep it up-to-date while this process completes. When it is in sync (minus the rows you want to drop), rename both tables in a transaction, so that the new one takes the place of the old. Drop the old table, and voila!

This (obviously) requires a lot of extra disk space, and may tax your I/O resources, but otherwise, can be much faster.

Depending on the nature of the data or in an emergency, you could rename the old table and create a new, empty table in it's place, and select the "keep" rows into the new table at your leisure...

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!