问题
Let's say that we have this dataframe:
x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5)))
colnames(x)<- c("ID", "Visit", "Time", "State")
Column ID indicates subject ID.
Column Visit indicates a series of visits
Column Time indicates the time that has passed to reach a certain "State"
Column State indicates severity of a certain disease, where 5 means death. That means that you can fluctuate from worse states to better states, but you can never improve from category 5, since you are dead.
I would like to identify only those subjects that improved from category 5 to a better one, since these are errors from the dataframe (i.e. rows 13 and 16).
Additionally, I would like to remove those rows where a subject seems to have died more than once (i.e. row 18).
I made a similar question before, but it was very general and it implied that all fluctuations to a better state were removed from the dataset, which it is not what I actually want.
回答1:
Answer to modified question
The OP has modified the question substantially by requesting that all rows are considered erroneous which appear after the first occurrence of State 5 (death). This includes false recoveries (as in rows 13 and 16) as well as "duplicated deaths" (as in rows 17 and 18).
An answer to this requires a complete different approach. One possibility is to use a non-equi join:
library(data.table)
setDT(x)[x[, first(Visit[State == 5]), by = ID], on = .(ID, Visit > V1), error := TRUE][]
ID Visit Time State error 1: A 1 10.0 1 NA 2: A 2 12.5 3 NA 3: A 3 15.0 4 NA 4: B 1 2.0 1 NA 5: B 2 3.4 2 NA 6: B 3 5.7 3 NA 7: B 2 8.0 4 NA 8: B 3 9.5 3 NA 9: C 1 1.0 2 NA 10: C 2 5.6 2 NA 11: C 3 8.9 3 NA 12: C 4 10.0 5 NA 13: C 5 11.0 2 TRUE 14: D 1 2.0 3 NA 15: D 2 3.4 5 NA 16: D 3 6.0 4 TRUE 17: D 4 8.0 5 TRUE 18: D 5 10.5 5 TRUE
The number of the first visit with State 5 is returned by
x[, first(Visit[State == 5]), by = ID]
ID V1 1: C 4 2: D 2
In the subsequent non-equi join only those rows are marked which appear after the first State 5 event.
Data
x <- data.frame(
ID = c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
Visit = c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
Time = c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
State = c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5))
回答2:
Answer to the original question
The OP has requested to identify errors in the data frame where State 5 is followed by any State < 5 for each ID. In the sample data set rows 13 and 16 should be marked.
The answer of Hardik gupta points in the right direction but does not return the expected result. So, rows 12 and 15 are marked instead of rows 13 and 16. Furthermore, there is a false alarm set for row 17.
There are three essential changes required: (1) use lag instead of lead and (2) supply a fill value to shift():
library(data.table)
setDT(x)[, error := State < 5 & shift(State, fill = 0) == 5, by = ID][]
ID Visit Time State error 1: A 1 10.0 1 FALSE 2: A 2 12.5 3 FALSE 3: A 3 15.0 4 FALSE 4: B 1 2.0 1 FALSE 5: B 2 3.4 2 FALSE 6: B 3 5.7 3 FALSE 7: B 2 8.0 4 FALSE 8: B 3 9.5 3 FALSE 9: C 1 1.0 2 FALSE 10: C 2 5.6 2 FALSE 11: C 3 8.9 3 FALSE 12: C 4 10.0 5 FALSE 13: C 5 11.0 2 TRUE 14: D 1 2.0 3 FALSE 15: D 2 3.4 5 FALSE 16: D 3 6.0 4 TRUE 17: D 4 8.0 5 FALSE 18: D 5 10.5 5 FALSE
Data
The third change is required for creating the sample data set.
cbind() returns a matrix which turns all columns into the same type which is factor in this case. So, all columns consisting of numbers are treated as factor. To avoid this, the sample data set needs to be defined as:
x <- data.frame( ID = c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"), Visit = c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5), Time = c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5), State = c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5))
回答3:
You can use data.table and shift like this
library(data.table)
setDT(x)[, status := ((State == 5) & (shift(State,1,"lead") != 5)), by = ID]
x
ID Visit Time State status
1: A 1 10 1 FALSE
2: A 2 12.5 3 FALSE
3: A 3 15 4 FALSE
4: B 1 2 1 FALSE
5: B 2 3.4 2 FALSE
6: B 3 5.7 3 FALSE
7: B 2 8 4 FALSE
8: B 3 9.5 3 FALSE
9: C 1 1 2 FALSE
10: C 2 5.6 2 FALSE
11: C 3 8.9 3 FALSE
12: C 4 10 5 TRUE
13: C 5 11 2 FALSE
14: D 1 2 3 FALSE
15: D 2 3.4 5 TRUE
16: D 3 6 4 FALSE
17: D 4 8 5 TRUE
18: D 5 10.5 5 FALSE
回答4:
I'm still unclear what you'd like to do. Aren't rows 12, 15 and 17 the erroneous ones and should be removed?
do.call(rbind.data.frame, lapply(tmp, function(w) {
idx <- diff(w$State) <= 0 & w$State[-length(w$State)] == 5;
w[!idx, ];
}))
# ID Visit Time State
#A.1 A 1 10 1
#A.2 A 2 12.5 3
#A.3 A 3 15 4
#B.4 B 1 2 1
#B.5 B 2 3.4 2
#B.7 B 2 8 4
#B.6 B 3 5.7 3
#B.8 B 3 9.5 3
#C.9 C 1 1 2
#C.10 C 2 5.6 2
#C.11 C 3 8.9 3
#C.13 C 5 11 2
#D.14 D 1 2 3
#D.16 D 3 6 4
#D.18 D 5 10.5 5
来源:https://stackoverflow.com/questions/47746656/how-to-remove-inconsistencies-from-dataframe-time-series