R: respect quotes around numbers (treat as character) with read.csv()?

问题

I have a .csv file with account codes in the form of 00xxxxx and I need them to stay that way for use with other programs which use the account codes in this format. I was just working on an R script to reconcile account charges on Friday and swore that as.is = T was working for me. Now, it doesn't seem to be. Here's some example data:

test <- data.frame(col1 = c("apple", "banana", "carrot"),
                   col2 = c(100, 200, 300),
                   col3 = c("00234", "00345", "00456"))

My write.table strategy:

write.table(test, file = "C:/path/test.csv", quote = T,
            sep=",", row.names = F)

Remove the old data.frame and re-read:

rm(test)
test <- read.csv("C:/path/test.csv")
test

    col1 col2 col3
1  apple  100  234
2 banana  200  345
3 carrot  300  456

In case it's not clear, it should look like the original data.frame we created:

test
    col1 col2  col3
1  apple  100 00234
2 banana  200 00345
3 carrot  300 00456

I also tried the following, after perusing the available read.table options, with the results all the same as above:

test <- read.csv("C:/path/test.csv", quote = '"')
test <- read.csv("C:/path/test.csv", as.is = T)
test <- read.csv("C:/path/test.csv", as.is = T, quote = '"')

StringsAsFactors didn't seem to be relevant in this case (and sounds like as.is will do the same thing.

When I open the file in Emacs, col3 is, indeed, surrounded by quotes, so I'd expect it to be treated like text instead of converted to a number:

Most of the other questions are simply about not treating things like factors, or getting numbers not to be recognized as characters, usually the result of an overlooked character string in that column.

I see I can pursue the colClasses argument from questions like this one, but I'd prefer not to; my "colClasses" are built into the data :) Quoted = character, not quoted = numeric.

回答1:

I expect there's a better method, but one option would be to use quote=""

test <- read.csv("C:/path/test.csv", as.is = TRUE, quote = "")

This would make the quotes part of the values, giving you:

test
#col1 col2  col3
#1  apple  100 "00234"
#2 banana  200 "00345"
#3 carrot  300 "00456"

You could then either keep them in that format, or use something like gsub to remove them:

test$col3 <- gsub('"', '', test$col3)

test
#col1 col2  col3
#1  apple  100 00234
#2 banana  200 00345
#3 carrot  300 00456

You can use some kind of apply-type function to do the gsub on the whole data frame at once:

test <- as.data.frame(sapply(test,gsub,pattern='"',replacement=""))

sapply code taken from: R - how to replace parts of variable strings within data frame

Obviously, this method will only be useful to you if you don't need the quotes elsewhere for other reasons.

回答2:

After pinging a couple of friends who are R users, they both suggested using colClasses. I was relieved to find that I didn't need to specify each class, since my data is ~25 columns. SO confirmed this (once I knew what I was looking for) in another question.

test <- read.csv("C:/path/test.csv", colClasses = c(col3 = "character"))
test

    col1 col2  col3
1  apple  100 00234
2 banana  200 00345
3 carrot  300 00456

As it currently stands, the question is a duplicate of the other with respect to the solution. The difference is that I was looking for ways other than colClasses (since as.is sounds like such a likely candidate), while that question was about how to use colClasses.

I'll reiterate that I don't actually like this solution, even thought it's pretty simple. Quotes denote text fields in a .csv, and they don't seem to be respected in this case. The LibreOffice .csv import has a checkbox for "Treat quoted fields as text," which I'd think is analogous to as.is = T in R. Obviously not! #end_rant

回答3:

I have this issue too. Of course you can manually specify colClasses, but why is this necessary when data is quoted? I agree with the OP's 'rant' in the answer posted to his own question:

Quotes denote text fields in a .csv, and they don't seem to be respected in this case.

Anyway, I elected to use data.table's fread() which doesn't have this issue. Still annoying behaviour for read.csv though.

# here's a data frame with chr and int columns
my_df <- data.frame(chars=letters[1:5],
                    nums=1:5,
                    txt_nums=sprintf('%02d', 1:5),
                    stringsAsFactors=F)

# all looks as it should
lapply(my_df, class)

# $chars
# [1] "character"
# 
# $nums
# [1] "integer"
# 
# $txt_nums
# [1] "character"

But now, write to csv, read it back in, and the third column is coerced to int!

# quote=T redundant since that's the default, but just to be sure
write.csv(my_df, 'my_df.csv', row.names=F, quote=T) 
my_df2 <- read.csv('my_df.csv')
lapply(my_df2, class)

# even with as.is=TRUE, same issue
my_df2 <- read.csv('my_df.csv', as.is=T)
lapply(my_df2, class)

# data.table's fread doesn't have this issue, at least
library(data.table)
my_dt <- fread('my_df.csv')
lapply(my_dt, class)

来源：https://stackoverflow.com/questions/22923756/r-respect-quotes-around-numbers-treat-as-character-with-read-csv

标签

csv

formatting