问题
I am following up on my question here sqldf returns zero observations with a reproducible example.
I found that the problem is probably from the "comma" in one of the cells ("1,500+") and I think that I have to use a filter as suggested here sqldf, csv, and fields containing commas, but I am not sure how to define my filter. Below is the code:
library(sqldf)
df <- data.frame("a" = c("8600000US01770" , "8600000US01937"),
"b"= c("1,500+" , "-"),
"c"= c("***" , "**"),
"d"= c("(x)" , "(x)"),
"e"= c("(x)" , "(x)"),
"f"= c(992 , "-"))
write.csv(df, 'df_to_read.csv')
# 'df_to_read.csv' looks like this:
"","a","b","c","d","e","f"
1,8600000US01770,1,500+,***,(x),(x),992
2,8600000US01937,-,**,(x),(x),-
Housing <- file("df_to_read.csv")
Housing_filtered <- sqldf('SELECT * FROM Housing', file.format = list(eol="\n"))
When I run this code, I get the following error:
Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) : RS_sqlite_import: df_to_read.csv line 2 expected 7 columns of data but found 8
回答1:
The problem comes from reading the column created by df$b
. The first value in that column contains comma and so sqldf()
function treats it as a separator.
One way to deal with this is to either remove comma or use some other symbol (like space).You can also use read.csv2.sql
function:
library(sqldf)
df <- data.frame("a" = c("8600000US01770" , "8600000US01937"),
"b"= c("1,500+" , "-"),
"c"= c("***" , "**"),
"d"= c("(x)" , "(x)"),
"e"= c("(x)" , "(x)"),
"f"= c("992" , "-"))
write.csv(df, 'df_to_read.csv',row.names = FALSE )
Housing_filtered <- read.csv2.sql("df_to_read.csv", sql = "select * from file", header=TRUE)
回答2:
Best way would be to clean your file once, so that you don't need to worry later again in your analysis for the same issue. This should get you going:
Housing <- readLines("df_to_read.csv") # read the file
n <- 6 # number of separators expected = number of columns expected - 1
library(stringr)
ln_idx <- ifelse(str_count(Housing, pattern = ",") == n, 0 , 1)
which(ln_idx == 1) # line indices with issue, includes the header row
#[1] 2
Check for the specific issues and write back to you file, at the same indices. for eg line (2):
Housing[2]
#[1] "1,8600000US01770,1,500+,***,(x),(x),992" # hmm.. extra comma
Housing[2] = "1,8600000US01770,1500+,***,(x),(x),992" # removed the extra comma
writeLines(Housing, "df_to_read.csv")
Now the business is usual, good to go:
Housing <- file("df_to_read.csv")
Housing_filtered <- sqldf('SELECT * FROM Housing')
# Housing_filtered
# a b c d e f
# 1 8600000US01770 1500+ *** (x) (x) 992
# 2 8600000US01937 - ** (x) (x) -
来源:https://stackoverflow.com/questions/50893208/dealing-with-commas-in-a-csv-file-in-sqldf