When reading the text below, fread() fails to detect the missing values in columns 8 and 9. This is only with the default option integer64="integer64". Setting integer64="double" or "character" correctly detects NAs. Note that the file has three types of possible NAs in V8 and V9-- ,,; , ,; and NA. Appending na.strings=c("NA","N/A",""," "), sep="," as options has no effect.
Using read.csv() works the same way as fread(integer="double").
Text to be read (also available as a file integer64_and_NA.csv):
2012,276,,0,"S1","001",1,,724135215,1590915056, 2012,276,2,8,"S1","001",1, ,,154598,0 2012,276,2,12,"S1","001",1,NA,5118863,21819477, 2012,276,2,0,"S1","011",8,3127133583,3127133583,9003982501,0 Here's the output from fread():
DT <- fread(input="integer64_and_NA.csv", verbose=TRUE, integer64="integer64", na.strings=c("NA","N/A",""," "), sep=",") Input contains no \n. Taking this to be a filename to open Detected eol as \r\n (CRLF) in that order, the Windows standard. Looking for supplied sep ',' on line 4 (the last non blank line in the first 'autostart') ... found ok Found 11 columns First row with 11 fields occurs on line 1 (either column names or first row of data) Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names. Count of eol after first data row: 5 Subtracted 1 for last eol and any trailing empty lines, leaving 4 data rows Type codes: 11114412221 (first 5 rows) Type codes: 11114412221 (after applying colClasses and integer64) Type codes: 11114412221 (after applying drop or select (if supplied) Allocating 11 column slots (11 - 0 NULL) 0.000s ( 0%) Memory map (rerun may be quicker) 0.000s ( 0%) sep and header detection 0.000s ( 0%) Count rows (wc -l) 0.000s ( 0%) Column type detection (first, middle and last 5 rows) 0.000s ( 0%) Allocation of 4x11 result (xMB) in RAM 0.000s ( 0%) Reading data 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 0.000s ( 0%) Coercing data already read in type bumps (if any) 0.000s ( 0%) Changing na.strings to NA 0.001s Total The resulting data.table is:
DT V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 1: 2012 276 NA 0 S1 001 1 9218868437227407266 724135215 1590915056 NA 2: 2012 276 2 8 S1 001 1 9218868437227407266 9218868437227407266 154598 0 3: 2012 276 2 12 S1 001 1 9218868437227407266 5118863 21819477 NA 4: 2012 276 2 0 S1 011 8 3127133583 3127133583 9003982501 0 NA values are properly detected in columns which are not integer64. For V8 and V9, which fread() marks as integer64, instead of NAs we have "9218868437227407266". Interestingly enough, str() returns the respective values of V8 and V9 as NA:
str(DT) Classes ‘data.table’ and 'data.frame': 4 obs. of 11 variables: $ V1 : int 2012 2012 2012 2012 $ V2 : int 276 276 276 276 $ V3 : int NA 2 2 2 $ V4 : int 0 8 12 0 $ V5 : chr "S1" "S1" "S1" "S1" $ V6 : chr "001" "001" "001" "011" $ V7 : int 1 1 1 8 $ V8 :Class 'integer64' num [1:4] NA NA NA 1.55e-314 $ V9 :Class 'integer64' num [1:4] 3.58e-315 NA 2.53e-317 1.55e-314 $ V10:Class 'integer64' num [1:4] 7.86e-315 7.64e-319 1.08e-316 4.45e-314 $ V11: int NA 0 NA 0 - attr(*, ".internal.selfref")=<externalptr> ... but nothing else sees them as NA:
is.na(DT$V8) [1] FALSE FALSE FALSE FALSE max(DT$V8) integer64 [1] 9218868437227407266 > max(DT$V8, na.rm=TRUE) integer64 [1] 9218868437227407266 > class(DT$V8) [1] "integer64" > typeof(DT$V8) [1] "double" It does not seem to be a print/screen issue only, data.table sees them as huge integers:
DT[, V12:=as.numeric(V8)] Warning message: In as.double.integer64(V8) : integer precision lost while converting to double > DT V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 1: 2012 276 NA 0 S1 001 1 9218868437227407266 724135215 1590915056 NA 9.218868e+18 2: 2012 276 2 8 S1 001 1 9218868437227407266 9218868437227407266 154598 0 9.218868e+18 3: 2012 276 2 12 S1 001 1 9218868437227407266 5118863 21819477 NA 9.218868e+18 4: 2012 276 2 0 S1 011 8 3127133583 3127133583 9003982501 0 3.127134e+09 Am I missing something about integer64, or is this a bug? As said above, I can get around using integer64="double", possibly losing some precision, as mentioned in the help file. But the unexpected behavior is with the default integer64...
This was done on a Windows 8.1 64-bit machine running Revolution R 3.0.2, and also on a virtual machine running kubuntu 13.10, CRAN-R 3.0.2. Tested with the latest stable data.table from CRAN (1.8.10 as of 7 Feb 2014) and 1.8.11 (rev. 1110, 2014-02-04 02:43:19, manually installed from the zip as the r-forge build is broken) on Windows, and only the stable 1.8.10 on linux. bit64 is installed and loaded on both machines.
> sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] bit64_0.9-3 bit_1.1-11 gdata_2.13.2 xts_0.9-7 zoo_1.7-10 nlme_3.1-113 hexbin_1.26.3 lattice_0.20-24 ggplot2_0.9.3.1 [10] plyr_1.8 reshape2_1.2.2 data.table_1.8.11 Revobase_7.0.0 RevoMods_7.0.0 RevoScaleR_7.0.0 loaded via a namespace (and not attached): [1] codetools_0.2-8 colorspace_1.2-4 dichromat_2.0-0 digest_0.6.4 foreach_1.4.1 gtable_0.1.2 gtools_3.2.1 iterators_1.0.6 [9] labeling_0.2 MASS_7.3-29 munsell_0.4.2 proto_0.3-10 RColorBrewer_1.0-5 reshape_0.8.4 scales_0.2.3 stringr_0.6.2 [17] tools_3.0.2