Is there a better way to process a 300,000 line text file data and insert it into MySQL?

折月煮酒 提交于 2019-12-04 21:18:12

From your code, it appears that your "unique identifier" (for the purposes of this insertion, at least) is the composite (READING_DATE, READING_TIME, READING_ADDRESS).

If you define such a UNIQUE key in your database, then LOAD DATA with the IGNORE keyword should do exactly what you require:

ALTER TABLE tbl_reading
  ADD UNIQUE KEY (READING_DATE, READING_TIME, READING_ADDRESS)
;

LOAD DATA INFILE '/path/to/csv'
    IGNORE
    INTO TABLE tbl_reading
    FIELDS
        TERMINATED BY ','
        OPTIONALLY ENCLOSED BY '"'
        ESCAPED BY ''
    LINES
        TERMINATED BY '\r\n'
    (@rec_0, @rec_1, @rec_2, @rec_3, @rec_4, @rec_5, @rec_6, @rec_7, @rec_8)
    SET
        READING_DATE = DATE_FORMAT(STR_TO_DATE(TRIM(@rec_0), '???'), '%Y/%m/%d'),
        READING_TIME = DATE_FORMAT(STR_TO_DATE(TRIM(@rec_1), '???'), '%H:%i:%s'),
        READING_ADDRESS    = TRIM(@rec_2),
        CO2_SET_VALUE      = TRIM(@rec_3),
        CO2_PROCESS_VALUE  = TRIM(@rec_4),
        TEMP_SET_VALUE     = TRIM(@rec_5),
        TEMP_PROCESS_VALUE = TRIM(@rec_6),
        RH_SET_VALUE       = TRIM(@rec_7),
        RH_PROCESS_VALUE   = TRIM(@rec_8)
;

(Where '???' are replaced with strings that represent the date and time formats in your CSV).

Note that you should really be storing READING_DATE and READING_TIME together in a single DATETIME or TIMESTAMP column:

ALTER TABLE tbl_reading
  ADD COLUMN READING_DATETIME DATETIME AFTER READING_TIME,
  ADD UNIQUE KEY (READING_DATETIME, READING_ADDRESS)
;

UPDATE tbl_reading SET READING_DATETIME = STR_TO_DATE(
  CONCAT(READING_DATE, ' ', READING_TIME),
  '%Y/%m/%d %H:%i:%s'
);

ALTER TABLE tbl_reading
  DROP COLUMN READING_DATE,
  DROP COLUMN READING_TIME
;

In which case, the SET clause of the LOAD DATA command would include instead:

READING_DATETIME = STR_TO_DATE(CONCAT(TRIM(@rec_0), ' ', TRIM(@rec_1)), '???')

Reading a 1 MB file line by line takes less than a second. Even concatenating and then again splitting all lines doesn't take any amount of time.

With a simple test, inserting 100,000 rows took about 90 seconds.

But, doing a select query before the insert, increases the time needed by more than an order of magnitude.

The lesson to learn from this is, if you need to insert large amounts of data, do it in big chunks (see LOAD DATA INFILE). If you can't do this for whatever reasons, do inserts and inserts alone.

Update:

As @eggyal already suggested, add a unique key to your table definition. In my small, one column test, I added a unique key and changed insert to insert ignore. Wall clock time increased 15%-30% (~100-110 sec), which is much better than the increase to 38 min (25 times!) with separate select + insert.

So, as a conclusion, (stealing from eggyal) add

ALTER TABLE tbl_reading
  ADD UNIQUE KEY (READING_DATE, READING_TIME, READING_ADDRESS)

to your table, remove the select in InsertData() and change insert to insert ignore.

You need to make some preparations before starting your inserts because InnoDB engine makes inserts too slow with default settings.

either set this option before insert

innodb_flush_log_at_trx_commit=0

or make all your inserts into one transaction.
And it will be blazing fast, no matter what syntax or driver you choose.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!