Preventing column-class inference in fread()

Is there a way for fread to mimic the behaviour of read.table whereby the class of the variable is set by the data that is read in.

I have numeric data with a few comments underneath the main data. When i use fread to read in the data, the columns are converted to character. However, by setting the nrow in read.table` i can stop this behaviour. Is this possible in fread. (I would prefer not to alter the raw data or make an amended copy). Thanks

An example

d <- data.frame(x=c(1:100, NA, NA, "fff"), y=c(1:100, NA,NA,NA)) 
write.csv(d, "test.csv",  row.names=F)

in_d <- read.csv("test.csv", nrow=100, header=T)
in_dt <- data.table::fread("test.csv", nrow=100)

Which produces

> str(in_d)
'data.frame':   100 obs. of  2 variables:
 $ x: int  1 2 3 4 5 6 7 8 9 10 ...
 $ y: int  1 2 3 4 5 6 7 8 9 10 ...
> str(in_dt)
Classes ‘data.table’ and 'data.frame':  100 obs. of  2 variables:
 $ x: chr  "1" "2" "3" "4" ...
 $ y: int  1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr>

As a workaround I thought i would be able to use read.table to read in one line, get the class and set the colClasses, but i am misunderstanding.

cl <- read.csv("test.csv", nrow=1,  header=T)
cols <- unname(sapply(cl, class))
in_dt <- data.table::fread("test.csv", nrow=100, colClasses=cols)
str(in_dt)

Using Windows8.1 R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit)

Option 1: Using a system command

fread() allows the use of a system command in its first argument. We can use it to remove the quotes in the first column of the file.

indt <- data.table::fread("cat test.csv | tr -d '\"'", nrows = 100)
str(indt)
# Classes ‘data.table’ and 'data.frame':    100 obs. of  2 variables:
#  $ x: int  1 2 3 4 5 6 7 8 9 10 ...
#  $ y: int  1 2 3 4 5 6 7 8 9 10 ...
#  - attr(*, ".internal.selfref")=<externalptr>

The system command cat test.csv | tr -d '\"' explained:

cat test.csv reads the file to standard output
| is a pipe, using the output of the previous command as input for the next command
tr -d '\"' deletes (-d) all occurrences of double quotes ('\"') from the current input

Option 2: Coercion after reading

Since option 1 doesn't seem to be working on your system, another possibility is to read the file as you did, but convert the x column with type.convert().

library(data.table)
indt2 <- fread("test.csv", nrows = 100)[, x := type.convert(x)]
str(indt2)
# Classes ‘data.table’ and 'data.frame':    100 obs. of  2 variables:
#  $ x: int  1 2 3 4 5 6 7 8 9 10 ...
#  $ y: int  1 2 3 4 5 6 7 8 9 10 ...
#  - attr(*, ".internal.selfref")=<externalptr>

Side note: I usually prefer to use type.convert() over as.numeric() to avoid the "NAs introduced by coercion" warning triggered in some cases. For example,

x <- c("1", "4", "NA", "6")
as.numeric(x)
# [1]  1  4 NA  6
# Warning message:
# NAs introduced by coercion 
type.convert(x)
# [1]  1  4 NA  6

But of course you can use as.numeric() as well.

Note: This answer assumes data.table dev v1.9.5

smci

Ok, the customer is abusing CSV format to intentionally write out trailing string rows to an integer column, yet without those rows starting with a comment.char (#).

Then you somehow expect you can override fread()'s type inference to read those as integer, by using nrow to try to limit it to just see the integer rows. read.csv(..., nrow) will accept this, however fread() always uses all rows for type-inference (not just the ones specified by nrow, skip, header), and even if they start with comment.char (that's a bug).

Sounds like an abuse of CSV. Your comment rows should be prepended with #
Yes, fread() needs a fix/enhance to ignore comment rows for type inference.
For now, you can workaround with fread() by post-processing the data-table read in.
It's arguable whether fread() should be changed to support the behavior you want: using nrows to limit what gets exposed to type-inference. It might fix your (pretty unique) case and break some others.

I don't see why you (EDIT: the customer) can't write your comments to a separate .txt/README/data-dictionary file to accompany the .csv. The practice of using a separate data-dictionary file is pretty well-established. I've never seen someone do this to a CSV file. At least move the comments to the header, not a footer.

来源：https://stackoverflow.com/questions/29499145/preventing-column-class-inference-in-fread

标签

data.table

read.table