问题
I have output data from a piece of equipment. Unfortunately the output data is not organized very well, and I have been writing a code in R to break it down. Essentially the data is a separate list of information (basic descriptive information, and raw data for two different measurements A and B for each time interval) for each subject pasted into one long document. For example:
Date: 01/01/2016
Time: 12:00:00
Subject: Subject1
A:
1: 1 2 4 1
2: 2 1 2 3
3: 1 0 2 7
B:
1: 2 3 0 1
2: 4 1 1 2
3: 3 5 2 8
Date: 01/01/2016
Time: 12:00:00
Subject: Subject2
A:
1: 8 2 0 1
2: 9 1 2 7
3: 1 6 2 7
B:
1: 2 3 2 0
2: 6 7 1 2
3: 3 3 2 4
I have written a code in R that works, but is not very elegant, using split(seq_along), for-loops, and do.call (based primarily on this stack overflow question and this blog post).
# First read text file in as a character vector called ‘example’
scan("example_file.txt", what="character", strip.white=T, sep="\n") -> example
# Separate the header text (before the colon) from the proceeding data
# and make that text name the components of the vector
regmatches(example, regexpr(example, pattern="[[:alnum:]]+:", useBytes = F)) -> names(example)
gsub(example, pattern="[[:print:]]+: ", replacement="", useBytes = F)-> example.2
# Then, split character vector into a list based on how many lines are
# dedicated to each subject (in this example, 11 lines); based on SE
# answer cited above
strsplit(example.2, "([A-Z]:)") -> example.3
split(as.list(example.3), ceiling(seq_along(example.2)/11)) -> example.4
# Use a for-loop to systematically add the data together for subjects 1
# and 2 for time interval 1, using the method detailed from a blog post
# (cited above)
my.list <- list()
for(i in 1:2){
strsplit(as.character(example.4[[i]][5]), split="[[:blank:]]+") -> A
strsplit(as.character(example.4[[i]][9]), split="[[:blank:]]+")-> B
as.vector(c(as.character(example.4[[i]][3]), "A", unlist(A))) -> A_char
as.vector(c(as.character(example.4[[i]][3]), "B", unlist(B))) -> B_char
paste(as.character(example.4[[i]][3]), "Measure_A") -> a_name
paste(as.character(example.4[[i]][3]), "Measure_B") -> b_name
my.list[[a_name]] <- A_char
my.list[[b_name]] <- B_char
}
final.data <- do.call(rbind, my.list)
as.data.frame(final.data) -> final.data
names(final.data) <- c("Subject", "Measure", "V1", "V2", "V3", "V4")
I can extract the data for a single time interval for A and B across all subjects using my code (for example, the lines "1: 1 2 4 1" and "1: 2 3 0 1" above) and put put all the information together in a data frame. Where is gets messy is when I want to do this for all of the time intervals, not just one time interval. I can't figure out how to do this without running separate for-loops for each time interval. I tried doing a for-loop within a for-loop, but that didn't work. I also couldn’t figure out how to do this with the apply()-type functions.
If I only had 3 time intervals, as per this example, this issue wouldn’t be so bad, but my actual data is a lot longer. Any suggestions for a more elegant and concise approach would be appreciated!
P.S. I am aware that the final data frame that the above code gives has redundant row names. However, this is a helpful way of making sure that the final data frame’s subject and measure information lines up with the labels I had applied to earlier R objects.
回答1:
This does everything but the rownames:
lines <- readLines(textConnection("Date: 01/01/2016
Time: 12:00:00
Subject: Subject1
A:
1: 1 2 4 1
2: 2 1 2 3
3: 1 0 2 7
B:
1: 2 3 0 1
2: 4 1 1 2
3: 3 5 2 8
Date: 01/01/2016
Time: 12:00:00
Subject: 2
A:
1: 8 2 0 1
2: 9 1 2 7
3: 1 6 2 7
B:
1: 2 3 2 0
2: 6 7 1 2
3: 3 3 2 4
Date: 01/01/2016
Time: 12:00:00
Subject: 2
A:
1: 8 2 0 1
2: 9 1 2 7
3: 1 6 2 7
B:
1: 2 3 2 0
2: 6 7 1 2
3: 3 3 2 4
3: 3 3 2 4"))
Some libraries we'll need for the non-base R solution:
library(purrr)
library(tibble)
library(tidyr)
library(dplyr)
Trim whitespace and filter out blank lines:
trimws(lines) %>% discard(`==`, "") -> lines
This makes a vector of the indexes in lines where the records start (which is designated by finding Date: at the beginning of a line):
starts <- which(grepl("^Date:", lines))
Now, we take those starts and look for the next occurrence of Date: (i.e. the next record). It's going to find them all, so we only care about the first one. To calculate that index, we add the start index and subtract 1. In theory there will only be one NA (i.e. the last record) but we lazily use ifelse vs just change he last one.
ends <- map_dbl(starts, function(i) {
which(grepl("^Date:", lines[(i+1):length(lines)]))[1]+i-1
})
ends <- ifelse(is.na(ends), length(lines), ends)
So, now starts contains the indexes of the start of each record and ends contains the indexes of the ends of each record.
The map2_df() is super handy pseudo-wrapper for mapply() & do.call(rbind,…). We use the fact that these are in DCF format (key: value) and use read.dcf(). That makes a matrix and we then re-orient it and turn it into a data.frame.
We then separate the values, add the row names to make a time_interval column, add in the date, time and subject and make sure the columns are the right type.
We also use the fact that map2_df() will use the named list "keys" as a column if we tell it to.
Finally, we reorder the columns.
So, this will iterate over starts and ends and pass each iteration into start and end:
map2_df(starts, ends, function(start, end) {
# now, we extract just the current record into `record` by pulling
# out lines by the indexes.
record <- lines[start:end]
# we then use `read.dcf` to read in the date/subject/time values:
header <- as.data.frame(read.dcf(textConnection(record[1:3])))
# Since we do not have blank lines and you said the records were
# uniform we can use the fact that they'll be at known index
# positions in this `record`. So, we make a list of two vectors
# which are the indexes. Each becomes `i` (two total iterations)
# and we use the value in `i` to extract out the three lines from
# `record` and read those via `read.dcf`.
# But that reads things into a matrix and in an unhelpful order
# so we transpose it into shape and make it a data frame since
# we'll ultimately need that.
# We use `separate` to take the single character space-separated
# `V1` column and turn it into 4 columns. `read.dcf` gave us
# named rows for each time interval so we promote that to a
# full-on column and then add in date/time/subject, ensuring
# they are characters and not factors, then ensure that the
# values we split out from `V1` are numeric and not character or
# factor.
# `map_df` can add in the `A` and `B` from the named list we passed
# in for us and we have it call that column `measure`.
# finally, we put the columns in a better order.
map_df(list(A=5:7, B=9:11), function(i) {
read.dcf(textConnection(record[i])) %>%
t() %>% as_data_frame() %>%
separate(V1, sprintf("V%d", 1:4)) %>%
rownames_to_column("time_interval") %>%
mutate(date=as.character(header$Date),
time=as.character(header$Time),
subject=header$Subject) %>%
mutate_at(vars(starts_with("V")), as.numeric)
}, .id="measure")
}) %>%
select(date, time, subject, measure, time_interval, V1, V2, V3, V4)
That produces the following output:
## # A tibble: 18 x 9
## date time subject measure time_interval V1 V2 V3 V4
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 01/01/2016 12:00:00 Subject1 A 1 1 2 4 1
## 2 01/01/2016 12:00:00 Subject1 A 2 2 1 2 3
## 3 01/01/2016 12:00:00 Subject1 A 3 1 0 2 7
## 4 01/01/2016 12:00:00 Subject1 B 1 2 3 0 1
## 5 01/01/2016 12:00:00 Subject1 B 2 4 1 1 2
## 6 01/01/2016 12:00:00 Subject1 B 3 3 5 2 8
## 7 01/01/2016 12:00:00 2 A 1 8 2 0 1
## 8 01/01/2016 12:00:00 2 A 2 9 1 2 7
## 9 01/01/2016 12:00:00 2 A 3 1 6 2 7
## 10 01/01/2016 12:00:00 2 B 1 2 3 2 0
## 11 01/01/2016 12:00:00 2 B 2 6 7 1 2
## 12 01/01/2016 12:00:00 2 B 3 3 3 2 4
## 13 01/01/2016 12:00:00 2 A 1 8 2 0 1
## 14 01/01/2016 12:00:00 2 A 2 9 1 2 7
## 15 01/01/2016 12:00:00 2 A 3 1 6 2 7
## 16 01/01/2016 12:00:00 2 B 1 2 3 2 0
## 17 01/01/2016 12:00:00 2 B 2 6 7 1 2
## 18 01/01/2016 12:00:00 2 B 3 3 3 2 4
If you really need a base R solution then:
do.call(rbind, mapply(function(start, end) {
record <- lines[start:end]
header <- as.data.frame(read.dcf(textConnection(record[1:3])))
do.call(rbind, lapply(list(A=5:7, B=9:11), function(i) {
mat <- as.data.frame(t(read.dcf(textConnection(record[i]))))
mat <- matrix(unlist(apply(mat, 1, strsplit, split=" "), use.names=FALSE), ncol=4, byrow=TRUE)
mat <- as.data.frame(mat)
mat$time_interval <- 1:3
mat$date <- as.character(header$Date)
mat$time <- as.character(header$Time)
mat$subject <- as.character(header$Subject)
mat
})) -> df
df$measure <- gsub("\\..*$", "", rownames(df))
rownames(df) <- NULL
df
}, starts, ends, SIMPLIFY=FALSE)) -> out_df
out_df[,c("date", "time", "subject", "measure", "time_interval", "V1", "V2", "V3", "V4")]
## date time subject measure time_interval V1 V2 V3 V4
## 1 01/01/2016 12:00:00 Subject1 A 1 1 2 4 1
## 2 01/01/2016 12:00:00 Subject1 A 2 2 1 2 3
## 3 01/01/2016 12:00:00 Subject1 A 3 1 0 2 7
## 4 01/01/2016 12:00:00 Subject1 B 1 1 2 4 1
## 5 01/01/2016 12:00:00 Subject1 B 2 2 1 2 3
## 6 01/01/2016 12:00:00 Subject1 B 3 1 0 2 7
## 7 01/01/2016 12:00:00 2 A 1 8 2 0 1
## 8 01/01/2016 12:00:00 2 A 2 9 1 2 7
## 9 01/01/2016 12:00:00 2 A 3 1 6 2 7
## 10 01/01/2016 12:00:00 2 B 1 8 2 0 1
## 11 01/01/2016 12:00:00 2 B 2 9 1 2 7
## 12 01/01/2016 12:00:00 2 B 3 1 6 2 7
## 13 01/01/2016 12:00:00 2 A 1 8 2 0 1
## 14 01/01/2016 12:00:00 2 A 2 9 1 2 7
## 15 01/01/2016 12:00:00 2 A 3 1 6 2 7
## 16 01/01/2016 12:00:00 2 B 1 8 2 0 1
## 17 01/01/2016 12:00:00 2 B 2 9 1 2 7
## 18 01/01/2016 12:00:00 2 B 3 1 6 2 7
回答2:
It is not clear that a data frame is the most convenient way to represent this data. The following shows three alternate outputs:
three arrays -- a matrix with one row per subject having Date time and subject columns, an
Aarray such thatA[,,i]is the A matrix for the ith subject and aBarray such thatB[,,i]is the B matrix for the ith subject. No packages are used.a data frame in wide form
a data frame in long form
No packages are used.
For all three, read the file into a character vector Lines. Then remove any blank lines using grep -- we could omit this step if we knew there were no blank lines. Then split Lines into subject groups s. Then lapply over the subject groups and within each grab the date, time and subject from the first three lines and the two matrices from lines 5:7 and 9:11 in a list with one component per subject. The key code to produce a list L easily reworked into different formats is just this:
Lines <- readLines("example_file.txt")
Lines <- grep("^\\s*$", Lines, value = TRUE, invert = TRUE)
s <- split(Lines, cumsum(grepl("^Date:", Lines)))
L <- lapply(s, function(x) list(read.dcf(textConnection(x[1:3])),
A = as.matrix(read.table(text = sub(":", "", x[5:7]), row.names = 1)),
B = as.matrix(read.table(text = sub(":", "", x[9:11]), row.names = 1))))
names(L) <- sapply(L, function(x) x[[1]][, "Subject"])
Given L we can easily create various output formats by using lapply over it. Each of the three formats is shown in a separate section below. The output is shown at the end so as not to break up the code.
three arrays
We could use L as is but may be more convenient to to convert L to three arrays: (1) ident which is a 3 column matrix having as many rows as subjects with the Date, time and Subject for each, (2) A which is a 3d array such that A[,,i] is the A matrix for the ith subject, (3) B which is a 3d array such that B[,,i] is the B matrix for the ith subject.
ident <- do.call(rbind, lapply(L, "[[", 1))
A <- simplify2array(lapply(L, "[[", 2))
B <- simplify2array(lapply(L, "[[", 3))
data.frame - wide form
DF <- do.call(rbind, lapply(L, function(x) data.frame(x[[1]], x[[2]], x[[3]])))
names(DF)[4:7] <- "A"
names(DF)[8:11] <- "B"
rownames(DF) <- NULL
data.frame - long form
DF2 <- do.call(rbind, lapply(L, function(x)
data.frame(x[[1]], rbind(cbind(AB = "A", x[[2]]), cbind(AB = "B", x[[3]])))))
rownames(DF2) <- NULL
output -- three arrays
> ident
Date Time Subject
[1,] "01/01/2016" "12:00:00" "Subject1"
[2,] "01/01/2016" "12:00:00" "2"
> A
, , Subject1
V2 V3 V4 V5
1 1 2 4 1
2 2 1 2 3
3 1 0 2 7
, , 2
V2 V3 V4 V5
1 8 2 0 1
2 9 1 2 7
3 1 6 2 7
> B
, , Subject1
V2 V3 V4 V5
1 2 3 0 1
2 4 1 1 2
3 3 5 2 8
, , 2
V2 V3 V4 V5
1 2 3 2 0
2 6 7 1 2
3 3 3 2 4
output -- data frame wide form
> DF
Date Time Subject A A A A B B B B
1 01/01/2016 12:00:00 Subject1 1 2 4 1 2 3 0 1
2 01/01/2016 12:00:00 Subject1 2 1 2 3 4 1 1 2
3 01/01/2016 12:00:00 Subject1 1 0 2 7 3 5 2 8
4 01/01/2016 12:00:00 2 8 2 0 1 2 3 2 0
5 01/01/2016 12:00:00 2 9 1 2 7 6 7 1 2
6 01/01/2016 12:00:00 2 1 6 2 7 3 3 2 4
output - data frame long form
> DF2
Date Time Subject AB V2 V3 V4 V5
1 01/01/2016 12:00:00 Subject1 A 1 2 4 1
2 01/01/2016 12:00:00 Subject1 A 2 1 2 3
3 01/01/2016 12:00:00 Subject1 A 1 0 2 7
4 01/01/2016 12:00:00 Subject1 B 2 3 0 1
5 01/01/2016 12:00:00 Subject1 B 4 1 1 2
6 01/01/2016 12:00:00 Subject1 B 3 5 2 8
7 01/01/2016 12:00:00 2 A 8 2 0 1
8 01/01/2016 12:00:00 2 A 9 1 2 7
9 01/01/2016 12:00:00 2 A 1 6 2 7
10 01/01/2016 12:00:00 2 B 2 3 2 0
11 01/01/2016 12:00:00 2 B 6 7 1 2
12 01/01/2016 12:00:00 2 B 3 3 2 4
来源:https://stackoverflow.com/questions/39085902/import-unusually-formatted-text-data-using-r