Faster way to read fixed-width files

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-26 14:28:52
MichaelChirico

Now that there are (between this and the other major question about effective reading of fixed-width files) a fair amount of options on the offer for reading in such files, I think some benchmarking is appropriate.

I'll use the following on-the-large-side (400 MB) file for comparison. It's just a bunch of random characters with randomly defined fields and widths:

set.seed(21394)
wwidth = 400L
rrows = 1000000

#creating the contents at random
contents = 
  write.table(replicate(rrows, paste0(sample(letters, wwidth, replace = TRUE),
                                      collapse = "")), file="testfwf.txt",
              quote = FALSE, row.names = FALSE, col.names = FALSE)

#defining the fields & writing a dictionary
n_fields = 40L
endpoints = unique(c(1L, sort(sample(wwidth, n_fields - 1L)), wwidth + 1L))
cols = ist(beg = endpoints[-(n_fields + 1L)],
             end = endpoints[-1L] - 1L)

dict = data.frame(column = paste0("V", seq_len(length(endpoints)) - 1L)),
                  start = endpoints[-length(endpoints)] - 1,
                  length = diff(endpoints))

write.csv(dict, file = "testdic.csv", quote = FALSE, row.names = FALSE)

I'll compare five methods mentioned between these two threads (I'll add some others if the authors would like): the base version (read.fwf), piping the result of in2csv to fread (@AnandaMahto's suggestion), Hadley's new readr (read_fwf), that using LaF/ffbase (@jwijffls' suggestion), and an improved (streamlined) version of that suggested by the question author (@MarkDanese) combining fread with stri_sub from stringi.

Here is the benchmarking code:

library(data.table)
library(stringi)
library(readr)
library(LaF); library(ffbase)
library(microbenchmark)

microbenchmark(times = 5L,
               utils = read.fwf("testfwf.txt", diff(endpoints), header = FALSE),
               in2csv = 
                 fread(paste("in2csv -f fixed -s",
                             "~/Desktop/testdic.csv",
                             "~/Desktop/testfwf.txt")),
               readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
               LaF = {
                 my.data.laf = 
                   laf_open_fwf('testfwf.txt', column_widths=diff(endpoints),
                                column_types = rep("character", 
                                                   length(endpoints) - 1L))
                 my.data = laf_to_ffdf(my.data.laf, nrows = rrows)
                 as.data.frame(my.data)},
               fread = fread(
                 "testfwf.txt", header = FALSE, sep = "\n"
                 )[ , lapply(seq_len(length(cols$beg)),
                             function(ii) 
                               stri_sub(V1, cols$beg[ii], cols$end[ii]))])

And the output:

# Unit: seconds
#    expr       min        lq      mean    median        uq       max neval cld
#   utils 423.76786 465.39212 499.00109 501.87568 543.12382 560.84598     5   c
#  in2csv  67.74065  68.56549  69.60069  70.11774  70.18746  71.39210     5 a  
#   readr  10.57945  11.32205  15.70224  14.89057  19.54617  22.17298     5 a  
#     LaF 207.56267 236.39389 239.45985 237.96155 238.28316 277.09798     5  b 
#   fread  14.42617  15.44693  26.09877  15.76016  20.45481  64.40581     5 a  

So it seems readr and fread + stri_sub are pretty competitive as the fastest; built-in read.fwf is the clear loser.

Note that the real advantage of readr here is that you can pre-specify column types; with fread you'll have to type convert afterwards.

EDIT: Adding some alternatives

At @AnandaMahto's suggestion I am including some more options, including one that appears to be a new winner! To save time I excluded the slowest options above in the new comparison. Here's the new code:

library(iotools)

microbenchmark(times = 5L,
               readr = read_fwf("testfwf.txt", fwf_widths(diff(endpoints))),
               fread = fread(
                 "testfwf.txt", header = FALSE, sep = "\n"
                 )[ , lapply(seq_len(length(cols$beg)),
                             function(ii) 
                               stri_sub(V1, cols$beg[ii], cols$end[ii]))],
               iotools = input.file("testfwf.txt", formatter = dstrfw, 
                                    col_types = rep("character",
                                                    length(endpoints) - 1L), 
                                    widths = diff(endpoints)),
               awk = fread(paste(
                 "awk -v FIELDWIDTHS='", 
                 paste(diff(endpoints), collapse = " "), 
                 "' -v OFS=', ' '{$1=$1 \"\"; print}' < ~/Desktop/testfwf.txt", 
                 collapse = " "), header = FALSE))

And the new output:

# Unit: seconds
#     expr       min        lq      mean    median        uq       max neval cld
#    readr  7.892527  8.016857 10.293371  9.527409  9.807145 16.222916     5  a 
#    fread  9.652377  9.696135  9.796438  9.712686  9.807830 10.113160     5  a 
#  iotools  5.900362  7.591847  7.438049  7.799729  7.845727  8.052579     5  a 
#      awk 14.440489 14.457329 14.637879 14.472836 14.666587 15.152156     5   b

So it appears iotools is both very fast and very consistent.

You can use the LaF package, which was written to handle large fixed width files (also too large to fit into memory). To use it you first need to open the file using laf_open_fwf. You can then index the resulting object as you would a normal data frame to read the data you need. In the example below, I read the entire file, but you can also read specific columns and/or lines:

library(LaF)
laf <- laf_open_fwf("foo.dat", column_widths = cols, 
  column_types=rep("character", length(cols)),
  column_names = seervars)
seer9 <- laf[,]

Your example using 5000 lines (instead of your 500,000) took 28 seconds using read.fwf and 1.6 seconds using LaF.

Addition Your example using 50,000 lines (instead of your 500,000) took 258 seconds using read.fwf and 7 seconds using LaF on my machine.

I'm not sure what OS you are using, but this worked pretty straightforwardly for me in Linux:

Step 1: Create a command for awk to convert the file to a csv

You can have it stored to an actual csv file if you plan to use the data in other software too.

myCommand <- paste(
  "awk -v FIELDWIDTHS='", 
  paste(cols, collapse = " "), 
  "' -v OFS=',' '{$1=$1 \"\"; print}' < ~/rawdata.txt", 
  collapse = " ")

Step 2: Use fread directly on that command that you just created

seer9 <- fread(myCommand)

I haven't timed this because I'm obviously using a slower system than you and Jan :-)

I wrote a parser for this kind of thing yesterday, but it was for a very specific kind of input to the header file, so I will show you how to format your column widths to be able to use it.

Converting your flat file to csv

First download the tool in question.

You can download the binary from the bin directory if you are on OS X Mavericks (where I compiled it on) or compile it by going to src and using clang++ csv_iterator.cpp parse.cpp main.cpp -o flatfileparser.

The flat file parser needs two files, a CSV header file in which every fifth element specifies the variable width (again, this is due to my extremely specific application), which you can generate using:

cols = c(8,10,1,2,1,1,1,3,4,3,2,2,4,4,1,4,1,4,1,1,1,1,3,2,2,1,2,2,13,2,4,1,1,1,1,3,3,3,2,3,3,3,3,3,3,3,2,2,2,2,1,1,1,1,1,6,6,6,2,1,1,2,1,1,1,1,1,2,2,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,7,5,4,10,3,3,2,2,2,3,1,1,1,1,2,2,1,1,2,1,9,5,5,1,1,1,2,2,1,1,1,1,1,1,1,1,2,3,3,3,3,3,3,1,4,1,4,1,1,3,3,3,3,2,2,2,2)
writeLines(sapply(c(-1, cols), function(x) paste0(',,,,', x)), '~/tmp/header.csv')

and copying the resulting ~/tmp/header.csv to the same directory as your flatfileparser. Move the flat file to the same directory as well, and you can run it on your flat file:

./flatfileparser header.csv yourflatfile

which will produce yourflatfile.csv. Add the header you have above in manually using piping (>> from Bash).

Reading in your CSV file quickly

Use Hadley's experimental fastread package by passing the filename to fastread::read_csv, which yields a data.frame. I don't believe he supports fwf files yet although it is on the way.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!