read.csv warning 'EOF within quoted string' prevents complete reading of file

匿名 (未验证) 提交于 2019-12-03 00:59:01

问题:

I have a CSV file (24.1 MB) that I cannot fully read into my R session. When I open the file in a spreadsheet program I can see 112,544 rows. When I read it into R with read.csv I only get 56,952 rows and this warning:

cit 

I can read the whole file into R with readLines:

rl 

But I can't get this back into R as a table (via read.csv):

write.table(rl, "rl.txt", quote = FALSE, row.names = FALSE) rl_in 

How can I solve or workaround this EOF message (which seems to be more of an error than a warning) to get the entire file into my R session?

I have similar problems with other methods of reading CSV files:

require(sqldf) cit_sql 

Here's my sessionInfo()

R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit)  locale: [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           [5] LC_TIME=English_United States.1252      attached base packages: [1] tools     tcltk     stats     graphics  grDevices utils     datasets  methods   base       other attached packages:  [1] ff_2.2-11             bit_1.1-10            data.table_1.8.8      sqldf_0.4-6.4          [5] RSQLite.extfuns_0.0.1 RSQLite_0.11.4        chron_2.3-43          gsubfn_0.6-5           [9] proto_0.3-10          DBI_0.2-7   

回答1:

You need to disable quoting.

cit 

I think is because of this kind of lines (check "Thorn" and "Minus")

 readLines("citations.CSV")[82] [1] "10.2307/3642839,10.2307/3642839\t,\"Thorn\" and \"Minus\" in Hieroglyphic Luvian Orthography\t,H. Craig Melchert\t,Anatolian Studies\t,38\t,\t,1988-01-01T00:00:00Z\t,pp. 29-42\t,British Institute at Ankara\t,fla\t,\t,"


回答2:

I'm a new-ish R user and thought I'd post this in case it helps anyone else. I was trying to read in data from a text file (separated with commas) that included a few Spanish characters and it took me forever to figure it out. I knew I needed to use UTF-8 encoding, set the header arg to TRUE, and that I need to set the sep arguemnt to ",", but then I still got hang ups. After reading this post I tried setting the fill arg to TRUE, but then got the same "EOF within quoted string" which I was able to fix in the same manner as above. My successful read.table looks like this:

target

The result has Spanish language characters and same dims I had originally, so I'm calling it a success! Thanks all!



回答3:

In the R help section, as pointed out above, just disabling quoting altogether, by simply adding:

    quote = "" 

to the read.csv() worked for me.

The error, "EOF within quoted string", occurred with:

    > iproscan.53A.neg     = read.csv("interproscan.53A.neg.n.csv",     +                        colClasses=c(pb.id      = "character",     +                                     genLoc     = "character",     +                                     icode      = "character",     +                                     length     = "character",     +                                     proteinDB  = "character",     +                                     protein.id = "character",     +                                     prot.desc  = "character",     +                                     start      = "character",     +                                     end        = "character",     +                                     evalue     = "character",     +                                     tchar      = "character",     +                                     date       = "character",     +                                     ipro.id    = "character",     +                                     prot.name  = "character",     +                                     go.cat     = "character",     +                                     reactome.id= "character"),     +                                     as.is=T,header=F)     Warning message:     In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :       EOF within quoted string     > dim(iproscan.53A.neg)     [1] 69383    16

And the file read in was missing 6,619 lines. But by disabling quoting

    > iproscan.53A.neg     = read.csv("interproscan.53A.neg.n.csv",     +                        colClasses=c(pb.id      = "character",     +                                     genLoc     = "character",     +                                     icode      = "character",     +                                     length     = "character",     +                                     proteinDB  = "character",     +                                     protein.id = "character",     +                                     prot.desc  = "character",     +                                     start      = "character",     +                                     end        = "character",     +                                     evalue     = "character",     +                                     tchar      = "character",     +                                     date       = "character",     +                                     ipro.id    = "character",     +                                     prot.name  = "character",     +                                     go.cat     = "character",     +                                     reactome.id= "character"),     +                                     as.is=T,header=F,**quote=""**)         >      > dim(iproscan.53A.neg)     [1] 76002    16

Worked without error and all lines were successfully read in.



回答4:

I also ran into this problem, and was able to work around a similar EOF error using:

read.table("....csv", sep=",", ...)

Notice that the separator parameter is defined within the more general read.table().



回答5:

Actually, using read.csv() to read a file with text content is not a good idea, disable the quote as set quote="" is only a temporary solution, it only worked with Separate quotation marks. There are other reasons would cause the warning, for example, some special characters.

so with these special character cases, the permanent solution is to check your file to find out what those special characters are and use regular expression to eliminate them.

Have you ever think of installing the package {data.table} and use fread() to read the file. it is much faster and would not bother you with this EOF warning. note that you the file it read in is not a class data.frame, data.table
has many good features, but you could change it using as.data.frame() if needed.



回答6:

I had the similar problem: EOF -warning and only part of data was loading with read.csv(). I tried the quotes="", but it only removed the EOF -warning.

But looking at the first row that was not loading, I found that there was a special character, an arrow → (hexadecimal value 0x1A) in one of the cells. After deleting the arrow I got the data to load normally.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!