问题
I'm trying to read a text file into R that looks like this:
Ant farm 45 67 89
Cookie 5 43 21
Mouse hole 5 87 32
Ferret 3 56 87
Etc.
My problem is that the file is space delimited and the first variable has some entries that include a space so reading into R creates an error due to different rows having more columns. Does anyone know a way to read this in?
回答1:
Ben's approach works great, but here is another approach using `strapplyc, gsubfn or strapply from the gsubfn package.
First read in the data and set col.names, the separator and the pattern to use:
r <- readLines(textConnection(
"Ant farm 45 67 89
Cookie 5 43 21
Mouse hole 5 87 32
Ferret 3 56 87"))
library(gsubfn)
col.names <- c("group", "x1", "x2", "x3")
sep <- "," # if comma can appear in fields use something else
pat <- "^(.*) +(\\d+) +(\\d+) +(\\d+) *$"
1) gsubfn
tmp <- sapply(strapplyc(r, pat), paste, collapse = sep)
read.table(text = tmp, col.names = col.names, as.is = TRUE, sep = sep)
2) strapplyc Alternately the same code but the last two statement are replaced with:
tmp <- gsubfn(pat, ... ~ paste(..., sep = sep), r)
read.table(text = tmp, col.names = col.names, as.is = TRUE, sep = sep)
3) strapply. This one and the variation that follows do not require that sep be defined.
library(data.table)
tmp <- strapply(r, pat,
~ data.table(
group = group,
x1 = as.numeric(x1),
x2 = as.numeric(x2),
x3 = as.numeric(x3)
))
rbindlist(tmp)
3a) This one involves some extra manipulation so we might favor one of the other solutions instead but for completeness here it is. The combine=list prevents the individual outputs from being munged and the simplify=c removes the extra layer that combine=list added. Finally we rbind everything together.
tmp <- strapply(r, pat,
~ data.frame(
group = group,
x1 = as.numeric(x1),
x2 = as.numeric(x2),
x3 = as.numeric(x3),
stringsAsFactors = FALSE
), combine = list, simplify = c)
do.call(rbind, tmp)
4) read.pattern The development version of the gsubfn package has a new function read.pattern that is particularly direct for this type of problem:
library(devtools) # source_url
source_url("https://gsubfn.googlecode.com/svn/trunk/R/read.pattern.R") # from dev repo
read.pattern(text = r, pattern = pat, col.names = col.names, as.is = TRUE)
Note: These approaches have a couple of advantages (though Ben's approach could be modified for these cases as well). This approach takes anything before the last 3 numbers and uses it as the first field, so if the first field has 3 or more words or one of the "words" is a set of digits (e.g. "17 inch ant farm") then it will still work.
回答2:
Read the data set into a character vector (I'm using textConnection() to avoid creating a test file; you can just readLines("your_file.txt")):
r <- readLines(textConnection(
"Ant farm 45 67 89
Cookie 5 43 21
Mouse hole 5 87 32
Ferret 3 56 87"))
Put (single) quotation marks around space-separated words:
r2 <- gsub("([[:alpha:]]+) +([[:alpha:]]+)","'\\1 \\2'",r)
(As @CarlWitthoft suggests, below, if you don't mind replacing the spaces with a different separator such as _, you could use gsub(" +([[:alpha:]]+)","_\\1",r) instead.)
Now read the results:
dat <- read.table(textConnection(r2))
If your file is huge it would be better to do this outside R with command-line tools such as sed ...
回答3:
Assuming you are on linux or osx and the file to be read in called test
read.table(pipe('perl -pe "s/(\\D+) (\\d+) (\\d+) (\\d+)/\\1\t\\2\t\\3\t\\4/" test'), sep='\t')
You can also make a more general function using the same approach to read any typed input
read_typed = function(file, types, sep=' ', ...){
all_types = c('character' = '([\\w ]+)', 'integer' = '(\\d+)', 'numeric' = '([\\d.eE\\-+]+)', 'logical' = '([TF]|TRUE|FALSE)')
command = paste0('perl -pe "s/', paste0(all_types[types], collapse=sep),
'/',
paste0('\\', seq_along(types), collapse='\t'),
'/" ', file)
read.table(sep='\t', pipe(command), ...)
}
read_typed('test', c("character", 'integer', 'integer', 'integer'))
来源:https://stackoverflow.com/questions/20806811/reading-a-space-delimited-text-file-where-first-column-also-has-spaces