I have a file looks like:
a 1,2,3,5
b 4,5,6,7
c 5,6,7,8
...
That the separator between 1st and 2nd is '\t', other separators are comma. How can I read this kind of data set as as dataframe having 5 fields.
I'd probably do this.
read.table(text = gsub(",", "\t", readLines("file.txt")))
V1 V2 V3 V4 V5
1 a 1 2 3 5
2 b 4 5 6 7
3 c 5 6 7 8
Unpacking that just a bit:
readLines()
reads the file into R as a character vector with one element for each line.gsub(",", "\t", ...)
replaces every comma with a tab, so that now we've got lines with just one kind of separating character.- The
text =
argument toread.table()
lets it know you are passing it a character vector to be read directly (rather than the name of a file containing your text data).
"Balanced" data
Judging by the way you've phrased your question, it seems that you know that your data are "balanced" (rectangular).
Are you looking for speedier options? You might want to combine fread
from "data.table" with my experimental concat.split.DT
function.
The solution would look something like (replace " "
with "\t"
for a tab):
concat.split.DT(fread("yourfile.txt", sep = " ", header=FALSE), "V2", ",")
Let's make up some data:
x <- c("a\t1,2,3,5", "b\t4,5,6,7","c\t5,6,7,8")
X <- c(replicate(10000, x))
temp <- tempfile()
writeLines(X, temp, sep="\n") ## Write it to a temporary file
Josh's answer:
system.time(out1 <- read.table(text = gsub(",", "\t", readLines(temp))))
# user system elapsed
# 0.679 0.000 0.676
head(out1)
# V1 V2 V3 V4 V5
# 1 a 1 2 3 5
# 2 b 4 5 6 7
# 3 c 5 6 7 8
# 4 a 1 2 3 5
# 5 b 4 5 6 7
# 6 c 5 6 7 8
dim(out1)
# [1] 30000 5
fread
+ concat.split.DT
(which is like using fread
twice, but is still super fast):
system.time(out2 <- concat.split.DT(fread(temp, sep = "\t", header=FALSE), "V2", ","))
# user system elapsed
# 0.027 0.000 0.028
head(out2)
# V1 V2_1 V2_2 V2_3 V2_4
# 1: a 1 2 3 5
# 2: b 4 5 6 7
# 3: c 5 6 7 8
# 4: a 1 2 3 5
# 5: b 4 5 6 7
# 6: c 5 6 7 8
dim(out2)
# [1] 30000 5
"Unbalanced" data
Although it doesn't apply to your problem, I should mention this for the benefit of others who might need to solve a similar problem:
One limitation of the above is that concat.split.DT
only handles "balanced" data. fread
doesn't have a fill
argument like read.table
does (and I seem to remember reading somewhere that it most likely won't have such an argument).
Here's an example of what I mean by unbalanced:
x2 <- c("a\t1,2,3,5,6,7", "b\t4,5,6,7","c\t5,6,7,8,9,10,11,12,13")
X2 <- c(replicate(10000, x2))
temp2 <- tempfile()
writeLines(X2, temp2, sep="\n")
read.table
can handle that with the fill = TRUE
argument:
system.time(out1b <- read.table(text = gsub(",", "\t", readLines(temp2)), fill=TRUE))
# user system elapsed
# 1.151 0.000 1.152
head(out1b)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 a 1 2 3 5 6 7 NA NA NA
# 2 b 4 5 6 7 NA NA NA NA NA
# 3 c 5 6 7 8 9 10 11 12 13
# 4 a 1 2 3 5 6 7 NA NA NA
# 5 b 4 5 6 7 NA NA NA NA NA
# 6 c 5 6 7 8 9 10 11 12 13
concat.split.DT
will give you a nasty error in such cases, but you can try my cSplit
function instead. It's not nearly as fast, but still performs decently:
system.time(out2b <- cSplit(fread(temp2, sep = "\t", header=FALSE), "V2", ","))
# user system elapsed
# 0.393 0.004 0.399
head(out2b)
# V1 V2_1 V2_2 V2_3 V2_4 V2_5 V2_6 V2_7 V2_8 V2_9
# 1: a 1 2 3 5 6 7 NA NA NA
# 2: b 4 5 6 7 NA NA NA NA NA
# 3: c 5 6 7 8 9 10 11 12 13
# 4: a 1 2 3 5 6 7 NA NA NA
# 5: b 4 5 6 7 NA NA NA NA NA
# 6: c 5 6 7 8 9 10 11 12 13
Scanner scan = new Scanner(file);
while (scan.hasNextLine()) {
String[] a = scan.nextLine().replace("\\t", ",").split(",");
//do something with the array
}
scan.close();
This did:
- create a scanner to process the file (Scanner scan)
- scan in the next file line (scan.nextLine()) for each file line based on hasNextLine()
- replaced tabs with commas (.replace("\t", ",")), so the separators were all the same
- split into an array by commas. Now you can process all the data alike regardless of the length of each line.
- Don't forget to close the scanner when you're done.
来源:https://stackoverflow.com/questions/23568981/how-to-read-data-with-different-separators