Reading a space delimited text file where first column also has spaces

☆樱花仙子☆ 提交于 2019-12-10 10:15:05

问题


I'm trying to read a text file into R that looks like this:

Ant farm 45 67 89
Cookie 5 43 21
Mouse hole 5 87 32
Ferret 3 56 87

Etc.

My problem is that the file is space delimited and the first variable has some entries that include a space so reading into R creates an error due to different rows having more columns. Does anyone know a way to read this in?


回答1:


Ben's approach works great, but here is another approach using `strapplyc, gsubfn or strapply from the gsubfn package.

First read in the data and set col.names, the separator and the pattern to use:

r <- readLines(textConnection(
 "Ant farm 45 67 89
Cookie 5 43 21
Mouse hole 5 87 32
Ferret 3 56 87"))

library(gsubfn)

col.names <- c("group", "x1", "x2", "x3")
sep <- ","  # if comma can appear in fields use something else
pat <- "^(.*) +(\\d+) +(\\d+) +(\\d+) *$"

1) gsubfn

tmp <- sapply(strapplyc(r, pat), paste, collapse = sep)
read.table(text = tmp, col.names = col.names, as.is = TRUE, sep = sep)

2) strapplyc Alternately the same code but the last two statement are replaced with:

tmp <- gsubfn(pat, ... ~ paste(..., sep = sep), r)
read.table(text = tmp, col.names = col.names, as.is = TRUE, sep = sep)

3) strapply. This one and the variation that follows do not require that sep be defined.

library(data.table)
tmp <- strapply(r, pat,
  ~ data.table(
      group = group, 
      x1 = as.numeric(x1), 
      x2 = as.numeric(x2), 
      x3 = as.numeric(x3)
    ))
rbindlist(tmp)

3a) This one involves some extra manipulation so we might favor one of the other solutions instead but for completeness here it is. The combine=list prevents the individual outputs from being munged and the simplify=c removes the extra layer that combine=list added. Finally we rbind everything together.

tmp <- strapply(r, pat,
  ~ data.frame(
      group = group, 
      x1 = as.numeric(x1), 
      x2 = as.numeric(x2), 
      x3 = as.numeric(x3),
      stringsAsFactors = FALSE
    ), combine = list, simplify = c)
do.call(rbind, tmp)

4) read.pattern The development version of the gsubfn package has a new function read.pattern that is particularly direct for this type of problem:

library(devtools) # source_url
source_url("https://gsubfn.googlecode.com/svn/trunk/R/read.pattern.R") # from dev repo

read.pattern(text = r, pattern = pat, col.names = col.names, as.is = TRUE)

Note: These approaches have a couple of advantages (though Ben's approach could be modified for these cases as well). This approach takes anything before the last 3 numbers and uses it as the first field, so if the first field has 3 or more words or one of the "words" is a set of digits (e.g. "17 inch ant farm") then it will still work.




回答2:


Read the data set into a character vector (I'm using textConnection() to avoid creating a test file; you can just readLines("your_file.txt")):

 r <- readLines(textConnection(
 "Ant farm 45 67 89
 Cookie 5 43 21
 Mouse hole 5 87 32
 Ferret 3 56 87"))

Put (single) quotation marks around space-separated words:

r2 <- gsub("([[:alpha:]]+) +([[:alpha:]]+)","'\\1 \\2'",r)

(As @CarlWitthoft suggests, below, if you don't mind replacing the spaces with a different separator such as _, you could use gsub(" +([[:alpha:]]+)","_\\1",r) instead.)

Now read the results:

dat <- read.table(textConnection(r2))

If your file is huge it would be better to do this outside R with command-line tools such as sed ...




回答3:


Assuming you are on linux or osx and the file to be read in called test

read.table(pipe('perl -pe "s/(\\D+) (\\d+) (\\d+) (\\d+)/\\1\t\\2\t\\3\t\\4/" test'), sep='\t')

You can also make a more general function using the same approach to read any typed input

read_typed = function(file, types, sep=' ', ...){

  all_types = c('character' = '([\\w ]+)', 'integer' = '(\\d+)', 'numeric' = '([\\d.eE\\-+]+)', 'logical' = '([TF]|TRUE|FALSE)')
  command = paste0('perl -pe "s/', paste0(all_types[types], collapse=sep),
                   '/',
                   paste0('\\', seq_along(types), collapse='\t'),
                   '/" ', file)
  read.table(sep='\t', pipe(command), ...)
}
read_typed('test', c("character", 'integer', 'integer', 'integer'))


来源:https://stackoverflow.com/questions/20806811/reading-a-space-delimited-text-file-where-first-column-also-has-spaces

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!