Deleting millions of rows in MySQL

前端 未结 14 749
春和景丽
春和景丽 2020-12-02 07:05

I recently found and fixed a bug in a site I was working on that resulted in millions of duplicate rows of data in a table that will be quite large even without them (still

相关标签:
14条回答
  • 2020-12-02 07:35

    I faced a similar problem. We had a really big table, about 500 GB in size with no partitioning and one only one index on the primary_key column. Our master was a hulk of a machine, 128 cores and 512 Gigs of RAM and we had multiple slaves too. We tried a few techniques to tackle the large-scale deletion of rows. I will list them all here from worst to best that we found-

    1. Fetching and Deleting one row at a time. This is the absolute worst that you could do. So, we did not even try this.
    2. Fetching first 'X' rows from the database using a limit query on the primary_key column, then checking the row ids to delete in the application and firing a single delete query with a list of primary_key ids. So, 2 queries per 'X' rows. Now, this approach was fine but doing this using a batch job deleted about 5 million rows in 10 minutes or so, due to which the slaves of our MySQL DB were lagged by 105 seconds. 105-second lag in 10-minute activity. So, we had to stop.
    3. In this technique, we introduced a 50 ms lag between our subsequent batch fetch and deletions of size 'X' each. This solved the lag problem but we were now deleting 1.2-1.3 million rows per 10 minutes as compared to 5 million in technique #2.
    4. Partitioning the database table and then deleting the entire partitions when not needed. This is the best solution we have but it requires a pre-partitioned table. We followed step 3 because we had a non-partitioned very old table with only indexing on the primary_key column. Creating a partition would have taken too much time and we were in a crisis mode. Here are some links related to partitioning that I found helpful- Official MySQL Reference, Oracle DB daily partitioning.

    So, IMO, if you can afford to have the luxury of creating a partition in your table, go for the option #4, otherwise, you are stuck with option #3.

    0 讨论(0)
  • 2020-12-02 07:36

    I had a really loaded base that needed to delete some older entries all the time. Some of the delete queries started to hang so I needed to kill them, and if there are too many deletes the whole base become unresponsive so I needed to restrict the parallel runs. So I've created a cron job running every minute starting this script:

    #!/bin/bash
    
    #######################
    #
    i_size=1000
    max_delete_queries=10
    sleep_interval=15
    min_operations=8
    max_query_time=1000
    
    USER="user"
    PASS="super_secret_password"
    
    log_max_size=1000000
    log_file="/var/tmp/clean_up.log"
    #
    #######################
    
    touch $log_file
    log_file_size=`stat -c%s "$log_file"`
    if (( $log_file_size > $log_max_size ))
    then
        rm -f "$log_file"
    fi 
    
    delete_queries=`mysql -u user -p$PASS -e  "SELECT * FROM information_schema.processlist WHERE Command = 'Query' AND INFO LIKE 'DELETE FROM big.table WHERE result_timestamp %';"| grep Query|wc -l`
    
    ## -- here the hanging DELETE queries will be stopped
    mysql-u $USER -p$PASS -e "SELECT ID FROM information_schema.processlist WHERE Command = 'Query' AND INFO LIKE 'DELETE FROM big.table WHERE result_timestamp %'and TIME>$max_query_time;" |grep -v ID| while read -r id ; do
        echo "delete query stopped on `date`" >>  $log_file
        mysql -u $USER -p$PASS -e "KILL $id;"
    done
    
    if (( $delete_queries > $max_delete_queries ))
    then
      sleep $sleep_interval
    
      delete_queries=`mysql-u $USER -p$PASS -e  "SELECT * FROM information_schema.processlist WHERE Command = 'Query' AND INFO LIKE 'DELETE FROM big.table WHERE result_timestamp %';"| grep Query|wc -l`
    
      if (( $delete_queries > $max_delete_queries ))
      then
    
          sleep $sleep_interval
    
          delete_queries=`mysql -u $USER -p$PASS -e  "SELECT * FROM information_schema.processlist WHERE Command = 'Query' AND INFO LIKE 'DELETE FROM big.table WHERE result_timestamp %';"| grep Query|wc -l`
    
          # -- if there are too many delete queries after the second wait
          #  the table will be cleaned up by the next cron job
          if (( $delete_queries > $max_delete_queries ))
            then
                echo "clean-up skipped on `date`" >> $log_file
                exit 1
            fi
      fi
    
    fi
    
    running_operations=`mysql-u $USER -p$PASS -p -e "SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST WHERE COMMAND != 'Sleep';"| wc -l`
    
    if (( $running_operations < $min_operations ))
    then
        # -- if the database is not too busy this bigger batch can be processed
        batch_size=$(($i_size * 5))
    else 
        batch_size=$i_size
    fi
    
    echo "starting clean-up on `date`" >>  $log_file
    
    mysql-u $USER -p$PASS -e 'DELETE FROM big.table WHERE result_timestamp < UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 31 DAY))*1000 limit '"$batch_size"';'
    
    if [ $? -eq 0 ]; then
        # -- if the sql command exited normally the exit code will be 0
        echo "delete finished successfully on `date`" >>  $log_file
    else
        echo "delete failed on `date`" >>  $log_file
    fi
    

    With this I've achieved around 2 million deletes per day which was ok for my usecase.

    0 讨论(0)
  • 2020-12-02 07:37

    Do it in batches of lets say 2000 rows at a time. Commit in-between. A million rows isn't that much and this will be fast, unless you have many indexes on the table.

    0 讨论(0)
  • 2020-12-02 07:38

    I had a use case of deleting 1M+ rows in the 25M+ rows Table in the MySQL. Tried different approaches like batch deletes (described above).
    I've found out that the fastest way (copy of required records to new table):

    1. Create Temporary Table that holds just ids.

    CREATE TABLE id_temp_table ( temp_id int);

    1. Insert ids that should be removed:

    insert into id_temp_table (temp_id) select.....

    1. Create New table table_new

    2. Insert all records from table to table_new without unnecessary rows that are in id_temp_table

    insert into table_new .... where table_id NOT IN (select distinct(temp_id) from id_temp_table);

    1. Rename tables

    The whole process took ~1hr. In my use case simple delete of batch on 100 records took 10 mins.

    0 讨论(0)
  • 2020-12-02 07:39

    I have faced similar issue while deleting multiple records from transaction table after moving them to archival table.

    I used to use temporary table to identify records to be deleted.

    The temporary table that I used 'archive_temp' to store ids created in memory without any indexes.

    Hence while deleting records from original transaction table as e.g. DELETE from tat where id in (select id from archive_temp); query used to return an error "LOST Connection to server"

    I created index on that temporary table as follows after creating it: ALTER TABLE archive_temp ADD INDEX( id);

    After this my delete query used to execute in less than seconds irrespective of number of records to be deleted from transaction table.

    Hence it would be better to check indexes. Hope this might help.

    0 讨论(0)
  • 2020-12-02 07:39

    I have not scripted anything to do this, and doing it properly would absolutely require a script, but another option is to create a new, duplicate table and select all the rows you want to keep into it. Use a trigger to keep it up-to-date while this process completes. When it is in sync (minus the rows you want to drop), rename both tables in a transaction, so that the new one takes the place of the old. Drop the old table, and voila!

    This (obviously) requires a lot of extra disk space, and may tax your I/O resources, but otherwise, can be much faster.

    Depending on the nature of the data or in an emergency, you could rename the old table and create a new, empty table in it's place, and select the "keep" rows into the new table at your leisure...

    0 讨论(0)
提交回复
热议问题