fread() error and strange behaviour when reading csv

白昼怎懂夜的黑 提交于 2020-02-03 09:51:29

问题


I used fread() from data.table library to try read a 540MB csv file. It returned an error message saying:

' ends field 36 on line 4 when detecting types: 20.00,8/25/2006 0:00:00,"07:05:00 PM","CST",143.00,"OTTAWA","KS","HAIL",1.00,"S","MINNEAPOLIS",8/25/2006 0:00:00,"07:05:00 PM",0.00,,1.00,"S","MINNEAPOLIS",0.00,0.00,,88.00,0.00,0.00,0.00,,0.00,,"TOP","KANSAS, East",,3907.00,9743.00,3907.00,9743.00,"Dime to nickel sized hail.

I have no idea what caused the error and want to track down if it's a bug or just some data formating issue that I can tweak fread() to process.

I managed to read the csv using read.csv(), and decided to track down the row that triggered the error above (line 617174, not line 4 as the error message above). I then re-output the row and one row each immediately preceding and following the offending row, written out using write.csv() as testout.csv

I was able to read back testout.csv using read.csv() creating a data frame with 3 observations, as expected. Using fread() on testout.csv, however, resulted in a data table with only 1 observation, which is the last row.

The four lines in testout.csv are below (I start a new line for each entry below for readability).

"STATE__","BGN_DATE","BGN_TIME","TIME_ZONE","COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_RANGE","BGN_AZI","BGN_LOCATI","END_DATE","END_TIME","COUNTY_END","COUNTYENDN","END_RANGE","END_AZI","END_LOCATI","LENGTH","WIDTH","F","MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","WFO","STATEOFFIC","ZONENAMES","LATITUDE","LONGITUDE","LATITUDE_E","LONGITUDE_","REMARKS","REFNUM"

20,"8/25/2006 0:00:00","07:01:00 PM","CST",139,"OSAGE","KS","TSTM WIND",5,"WNW","OSAGE CITY","8/25/2006 0:00:00","07:01:00 PM",0,NA,5,"WNW","OSAGE CITY",0,0,NA,52,0,0,0,"",0,"","TOP","KANSAS, East","",3840,9554,3840,9554,".",617129

20,"8/25/2006 0:00:00","07:05:00 PM","CST",143,"OTTAWA","KS","HAIL",1,"S","MINNEAPOLIS","8/25/2006 0:00:00","07:05:00 PM",0,NA,1,"S","MINNEAPOLIS",0,0,NA,88,0,0,0,"",0,"","TOP","KANSAS, East","",3907,9743,3907,9743,"Dime to nickel sized hail. .",617130

20,"8/25/2006 0:00:00","07:07:00 PM","CST",125,"MONTGOMERY","KS","TSTM WIND",3,"N","COFFEYVILLE","8/25/2006 0:00:00","07:07:00 PM",0,NA,3,"N","COFFEYVILLE",0,0,NA,61,0,0,0,"",0,"","ICT","KANSAS, Southeast","",3705,9538,3705,9538,"",617131

When I ran fread("testout.csv", sep=",", verbose=TRUE), the output was

Input contains no \n. Taking this to be a filename to open
File opened, filesize is  1.05E-06B
File is opened and mapped ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 5 (the last non blank line in the first 'autostart') ... found ok
Found 37 columns
First row with 37 fields occurs on line 5 (either column names or first row of data)
Some fields on line 5 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 2
Subtracted 1 for last eol and any trailing empty lines, leaving 1 data rows
Type codes: 1444144414444111441111111414444111141 (first 5 rows)
Type codes: 1444144414444111441111111414444111141 (after applying colClasses and integer64)
Type codes: 1444144414444111441111111414444111141 (after applying drop or select (if supplied)

Any idea what may have caused the unexpected results, and the error in the first place? And any way around it? Just to be clear, my aim is to be able to use fread() to read the main file, even though read.csv() works so far.


回答1:


UPDATE: Now fixed in v1.9.3 on GitHub :

  • fread() now accepts line breaks inside quoted fields. Thanks to Clayton Stanley for highlighting.
    See: fread and a quoted multi-line column value

Windows users are reporting success with the latest version from GitHub.



来源:https://stackoverflow.com/questions/24340370/fread-error-and-strange-behaviour-when-reading-csv

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!