Read fixed width text file

99封情书 提交于 2019-11-26 00:28:26

问题


I\'m trying to load this ugly-formatted data-set into my R session: http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for

Weekly SST data starts week centered on 3Jan1990

Nino1+2      Nino3        Nino34        Nino4
Week          SST SSTA     SST SSTA     SST SSTA     SST SSTA 
03JAN1990     23.4-0.4     25.1-0.3     26.6 0.0     28.6 0.3 
10JAN1990     23.4-0.8     25.2-0.3     26.6 0.1     28.6 0.3 
17JAN1990     24.2-0.3     25.3-0.3     26.5-0.1     28.6 0.3

So far, i can read the lines with

  x = readLines(path)

But the file mixes \'white space\' with \'-\' as separators, and i\'m not a regex expert. I Appreciate any help on turning this into a nice and clean R data-frame. thanks!


回答1:


This is a fixed width file. Use read.fwf() to read it:

x <- read.fwf(
  file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
  skip=4,
  widths=c(12, 7, 4, 9, 4, 9, 4, 9, 4))

head(x)

            V1   V2   V3   V4   V5   V6   V7   V8  V9
1  03JAN1990   23.4 -0.4 25.1 -0.3 26.6  0.0 28.6 0.3
2  10JAN1990   23.4 -0.8 25.2 -0.3 26.6  0.1 28.6 0.3
3  17JAN1990   24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6 0.3
4  24JAN1990   24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4 0.2
5  31JAN1990   25.1 -0.2 25.8 -0.2 26.7  0.1 28.4 0.2
6  07FEB1990   25.8  0.2 26.1 -0.1 26.8  0.1 28.4 0.3

Update

The package readr (released April, 2015) provides a simple and fast alternative.

library(readr)

x <- read_fwf(
  file="http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for",   
  skip=4,
  fwf_widths(c(12, 7, 4, 9, 4, 9, 4, 9, 4)))

Speed comparison: readr::read_fwf() was ~2x faster than utils::read.fwf ().




回答2:


Another way to determine widths...

df <- read.fwf(
  file=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"),
  widths=c(-1, 9, -5, 4, 4, -5, 4, 4, -5, 4, 4, -5, 4, 4),
  skip=4
)

The -1 in the widths argument says there is a one-character column that should be ignored,the -5 in the widths argument says there is a five-character column that should be ignored, likewise...

ref : https://www.inkling.com/read/r-cookbook-paul-teetor-1st/chapter-4/recipe-4-6




回答3:


First off, that question is directly from a the Coursera "Get Data and Clean It" course by Leeks. While there is another part of the question, the tough part is reading the file.

That said, the course is mostly intended for learning.

I hate R's fixed width procedure. It is slow and for large number of variables, it very quickly becomes a pain to negate certain columns, etc.

I think its easier to use readLines() and then from that use substr() to make your variables

x <- readLines(con=url("http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for"))

# Skip 4 lines
x <- x[-(1:4)]

mydata <- data.frame(var1 = substr(x, 1, 10),
                     var2 = substr(x, 16, 19),
                     var3 = substr(x, 20, 23),
                     var4 = substr(x, 29, 32)  # and so on and so on
                     )



回答4:


You can now use the read_fwf() function in Hadley Wickham's readr package.

  • Annoucement: http://blog.rstudio.org/2015/04/09/readr-0-1-0/
  • Development page: https://github.com/hadley/readr
  • CRAN page: http://cran.r-project.org/web/packages/readr/index.html

A huge performance improvement is to be expected, compared to base read.fwf().




回答5:


I document here the list of alternatives for reading fixed-width files in R, as well as providing some benchmarks for which is fastest.

My preferred approach is to combine fread with stringi; it's competitive as the fastest approach, and has the added benefit (IMO) of storing your data as a data.table:

library(data.table)
library(stringi)

col_ends <- 
  list(beg = c(1, 10, 15, 19, 23, 28, 32, 36,
               41, 45, 49, 54, 58),
       end = c(9, 14, 18, 22, 27, 31, 35,
               40, 44, 48, 53, 57, 61))

data = fread(
  "http://www.cpc.ncep.noaa.gov/data/indices/wksst8110.for", 
  header = FALSE, skip = 4L, sep = NULL
  )[, lapply(1:(length(col_ends$beg)),
             function(ii) 
               stri_sub(V1, col_ends$beg[ii], col_ends$end[ii]))
    ][ , paste0("V", c(2, 5, 8, 11)) := NULL]
#              V1   V3   V4   V6   V7   V9  V10  V12  V13
#    1: 03JAN1990 23.4 -0.4 25.1 -0.3 26.6  0.0 28.6  0.3
#    2: 10JAN1990 23.4 -0.8 25.2 -0.3 26.6  0.1 28.6  0.3
#    3: 17JAN1990 24.2 -0.3 25.3 -0.3 26.5 -0.1 28.6  0.3
#    4: 24JAN1990 24.4 -0.5 25.5 -0.4 26.5 -0.1 28.4  0.2
#    5: 31JAN1990 25.1 -0.2 25.8 -0.2 26.7  0.1 28.4  0.2
#   ---                                                  
# 1365: 24FEB2016 27.1  0.9 28.4  1.8 29.0  2.1 29.5  1.4
# 1366: 02MAR2016 27.3  1.0 28.6  1.8 28.9  1.9 29.5  1.4
# 1367: 09MAR2016 27.7  1.2 28.6  1.6 28.9  1.8 29.6  1.5
# 1368: 16MAR2016 27.5  1.0 28.8  1.7 28.9  1.7 29.6  1.4
# 1369: 23MAR2016 27.2  0.9 28.6  1.4 28.8  1.5 29.5  1.2

Note that fread automatically strips leading and trailing whitespace -- sometimes, this is undesirable, in which case set strip.white = FALSE.


We could also have started with a vector of column widths ww by doing:

ww <- c(9, 5, 4, 4, 5, 4, 4, 5, 4, 4, 5, 4, 4)
nd <- cumsum(ww)

col_ends <-
  list(beg = c(1, nd[-length(nd)]+1L),
       end = nd)

And we could have picked which columns to exclude more robustly by using negative indices like:

col_ends <- 
  list(beg = c(1, -10, 15, 19, -23, 28, 32, -36,
               41, 45, -49, 54, 58),
       end = c(9, 14, 18, 22, 27, 31, 35,
               40, 44, 48, 53, 57, 61))

Then replace col_ends$beg[ii] with abs(col_ends$beg[ii]) and in the next line:

paste0("V", which(col_ends$beg < 0))

Lastly, if you want the column names to be read programatically as well, you could clean up with readLines:

cols <-
  gsub("\\s", "", 
       sapply(1:(length(col_ends$beg)),
              function(ii) 
                stri_sub(readLines(URL, n = 4L)[4L], 
                         col_ends$beg[ii]+1L,
                         col_ends$end[ii]+1L)))

cols <- cols[cols != ""]

(note that combining this step with fread would require creating a copy of the table in order to remove the header row, and would thus be inefficient for large data sets)




回答6:


I don't know a thing about R, but I can provide you with a regex that will match such lines:

\s[0-9]{2}[A-Z]{3}[0-9]{4}(\s{5}[0-9]+\.[0-9]+[ -][0-9]+\.[0-9]+){4}


来源:https://stackoverflow.com/questions/14383710/read-fixed-width-text-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!