Skip rows in MySQL LOAD DATA INFILE statement when row has value 'x'

问题

Background: I have a fixed-width flat file with about 94 million rows of data. The file is from the HCUP Nationwide Inpatient Sample (NIS http://www.hcup-us.ahrq.gov/nisoverview.jsp), which provides information about hospitalizations over the past 12 years, each row a separate hospitalization. For my analyses, I will be querying diagnostic codes (ICD9-CM) to identify patients with various diagnoses.

The fixed-width file contains information on up to 15 diagnostic codes, which are provided as columns dx1 through dx15.

create table `core` (`key` char (14),
`dx1` char (5),
`dx10` char (5),
`dx11` char (5),
`dx12` char (5),
`dx13` char (5),
`dx14` char (5),
`dx15` char (5),
`dx19` char (5),
`dx2` char (5),
`dx3` char (5),
`dx4` char (5),
`dx5` char (5),
`dx6` char (5),
`dx7` char (5),
`dx8` char (5),
`dx9` char (5),
plus various other columns of patient demographics...);

I loaded all of the data into a MySQL table, named core, and can index the 15 columns. However, it seems advantageous to kind of normalize the dx* columns into a separate dx table, such as;

create table `dx` (
`key` char (14),
`icd9` char (5),
);

where key is a foreign key to the main core table. To load the data quickly into dx, I use:

LOAD DATA LOCAL INFILE 'data.ASC' INTO TABLE `dx` (@var1) SET `key` = substr(@var1, 1, 14), `icd9` = substr(@var1, 74, 5);
LOAD DATA LOCAL INFILE 'data.ASC' INTO TABLE `dx` (@var1) SET `key` = substr(@var1, 1, 14), `icd9` = substr(@var1, 79, 5);
LOAD DATA LOCAL INFILE 'data.ASC' INTO TABLE `dx` (@var1) SET `key` = substr(@var1, 1, 14), `icd9` = substr(@var1, 84, 5);
etc for all 15 columns...

The catch is that the each row in the fixed-width file only has a median of 3 diagnosis codes, so most of the dx* columns are just blank (' ' [five blank characters]). So, while the dx table has 1.41 billion (94 million * 15) rows after loading data, about 1.28 billion (94 million * 12) are blank diagnostic codes.

I've been simply removing them afterwards and optimizing, prior to indexing:

DELETE FROM `dx` WHERE `icd9` = "     ";
OPTIMIZE TABLE `dx`;
CREATE INDEX `icd9` ON `dx` (`icd9`);

However, this takes a lot of time.

Question: Is it possible to modify the LOAD DATA INFILE statement to skip the row if ICD9 = ' '[five blank characters], and would this be significantly faster than my current DELETE and OPTIMIZE method? If there is, I would like to pass this information on to future researchers working with these data.

回答1:

Is it possible to modify the LOAD DATA INFILE statement to skip the row if

No. There is an IGNORE option. However it use line numbers not inline logical comparisons.

would this be significantly faster than my current DELETE and OPTIMIZE method

Likely. But, as it's not an option, it doesn't matter.

回答2:

I guess, if you can use a unique key on your diagnostic codes, say key dc(c1,c2,c3) and use the load data infile file_name ignore into table option, all your unique key duplicates will be ignored. So, you are left with only one combination of codes that are '','',''. All the rest will be ignored. But, this will obviously consume more resources than the simple infile but should be faster than deleting afterwards. Also, I think it could be better if all your diagnostic codes are ints, this would store '0' for blanks and when there is a duplicate entry attempt, mysql should more quickly recognize an integer than a string.

I also suggest you don't use 'local' infile unless you are at a client.

来源：https://stackoverflow.com/questions/7880818/skip-rows-in-mysql-load-data-infile-statement-when-row-has-value-x

标签

mysql

database

data-warehouse