I recently found and fixed a bug in a site I was working on that resulted in millions of duplicate rows of data in a table that will be quite large even without them (still
DELETE FROM `table`
WHERE (whatever criteria)
ORDER BY `id`
LIMIT 1000
Wash, rinse, repeat until zero rows affected. Maybe in a script that sleeps for a second or three between iterations.
For us, the DELETE WHERE %s ORDER BY %s LIMIT %d
answer was not an option, because the WHERE criteria was slow (a non-indexed column), and would hit master.
SELECT from a read-replica a list of primary keys that you wish to delete. Export with this kind of format:
00669163-4514-4B50-B6E9-50BA232CA5EB
00679DE5-7659-4CD4-A919-6426A2831F35
Use the following bash script to grab this input and chunk it into DELETE statements [requires bash ≥ 4 because of mapfile
built-in]:
sql-chunker.sh
(remember to chmod +x
me, and change the shebang to point to your bash 4 executable):
#!/usr/local/Cellar/bash/4.4.12/bin/bash
# Expected input format:
: <<!
00669163-4514-4B50-B6E9-50BA232CA5EB
00669DE5-7659-4CD4-A919-6426A2831F35
!
if [ -z "$1" ]
then
echo "No chunk size supplied. Invoke: ./sql-chunker.sh 1000 ids.txt"
fi
if [ -z "$2" ]
then
echo "No file supplied. Invoke: ./sql-chunker.sh 1000 ids.txt"
fi
function join_by {
local d=$1
shift
echo -n "$1"
shift
printf "%s" "${@/#/$d}"
}
while mapfile -t -n "$1" ary && ((${#ary[@]})); do
printf "DELETE FROM my_cool_table WHERE id IN ('%s');\n" `join_by "','" "${ary[@]}"`
done < "$2"
Invoke like so:
./sql-chunker.sh 1000 ids.txt > batch_1000.sql
This will give you a file with output formatted like so (I've used a batch size of 2):
DELETE FROM my_cool_table WHERE id IN ('006CC671-655A-432E-9164-D3C64191EDCE','006CD163-794A-4C3E-8206-D05D1A5EE01E');
DELETE FROM my_cool_table WHERE id IN ('006CD837-F1AD-4CCA-82A4-74356580CEBC','006CDA35-F132-4F2C-8054-0F1D6709388A');
Then execute the statements like so:
mysql --login-path=master billing < batch_1000.sql
For those unfamiliar with login-path
, it's just a shortcut to login without typing password in the command line.
I'd also recommend adding some constraints to your table to make sure that this doesn't happen to you again. A million rows, at 1000 per shot, will take 1000 repetitions of a script to complete. If the script runs once every 3.6 seconds you'll be done in an hour. No worries. Your clients are unlikely to notice.
Here's the recommended practice:
rows_affected = 0
do {
rows_affected = do_query(
"DELETE FROM messages WHERE created < DATE_SUB(NOW(),INTERVAL 3 MONTH)
LIMIT 10000"
)
} while rows_affected > 0
Deleting 10,000 rows at a time is typically a large enough task to make each query efficient, and a short enough task to minimize the impact on the server4 (transactional storage engines might benefit from smaller transactions). It might also be a good idea to add some sleep time between the DELETE statements to spread the load over time and reduce the amount of time locks are held.
Reference MySQL High Performance
the following deletes 1,000,000 records, one at a time.
for i in `seq 1 1000`; do
mysql -e "select id from table_name where (condition) order by id desc limit 1000 " | sed 's;/|;;g' | awk '{if(NR>1)print "delete from table_name where id = ",$1,";" }' | mysql;
done
you could group them together and do delete table_name where IN (id1,id2,..idN) im sure too w/o much difficulty
I'd use mk-archiver from the excellent Maatkit utilities package (a bunch of Perl scripts for MySQL management) Maatkit is from Baron Schwartz, the author of the O'Reilly "High Performance MySQL" book.
The goal is a low-impact, forward-only job to nibble old data out of the table without impacting OLTP queries much. You can insert the data into another table, which need not be on the same server. You can also write it to a file in a format suitable for LOAD DATA INFILE. Or you can do neither, in which case it's just an incremental DELETE.
It's already built for archiving your unwanted rows in small batches and as a bonus, it can save the deleted rows to a file in case you screw up the query that selects the rows to remove.
No installation required, just grab http://www.maatkit.org/get/mk-archiver and run perldoc on it (or read the web site) for documentation.