How to remove inconsistencies from dataframe (time series)

别来无恙 提交于 2019-12-11 10:33:28

问题


Let's say that we have this dataframe:

x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
                        c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
                        c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
                        c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5)))
colnames(x)<- c("ID", "Visit", "Time", "State")

Column ID indicates subject ID.

Column Visit indicates a series of visits

Column Time indicates the time that has passed to reach a certain "State"

Column State indicates severity of a certain disease, where 5 means death. That means that you can fluctuate from worse states to better states, but you can never improve from category 5, since you are dead.

I would like to identify only those subjects that improved from category 5 to a better one, since these are errors from the dataframe (i.e. rows 13 and 16).

Additionally, I would like to remove those rows where a subject seems to have died more than once (i.e. row 18).

I made a similar question before, but it was very general and it implied that all fluctuations to a better state were removed from the dataset, which it is not what I actually want.


回答1:


Answer to modified question

The OP has modified the question substantially by requesting that all rows are considered erroneous which appear after the first occurrence of State 5 (death). This includes false recoveries (as in rows 13 and 16) as well as "duplicated deaths" (as in rows 17 and 18).

An answer to this requires a complete different approach. One possibility is to use a non-equi join:

library(data.table)
setDT(x)[x[, first(Visit[State == 5]), by = ID], on = .(ID, Visit > V1), error := TRUE][]
    ID Visit Time State error
 1:  A     1 10.0     1    NA
 2:  A     2 12.5     3    NA
 3:  A     3 15.0     4    NA
 4:  B     1  2.0     1    NA
 5:  B     2  3.4     2    NA
 6:  B     3  5.7     3    NA
 7:  B     2  8.0     4    NA
 8:  B     3  9.5     3    NA
 9:  C     1  1.0     2    NA
10:  C     2  5.6     2    NA
11:  C     3  8.9     3    NA
12:  C     4 10.0     5    NA
13:  C     5 11.0     2  TRUE
14:  D     1  2.0     3    NA
15:  D     2  3.4     5    NA
16:  D     3  6.0     4  TRUE
17:  D     4  8.0     5  TRUE
18:  D     5 10.5     5  TRUE

The number of the first visit with State 5 is returned by

x[, first(Visit[State == 5]), by = ID]
   ID V1
1:  C  4
2:  D  2

In the subsequent non-equi join only those rows are marked which appear after the first State 5 event.

Data

x <- data.frame(
  ID = c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
  Visit = c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
  Time = c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
  State = c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5))



回答2:


Answer to the original question

The OP has requested to identify errors in the data frame where State 5 is followed by any State < 5 for each ID. In the sample data set rows 13 and 16 should be marked.

The answer of Hardik gupta points in the right direction but does not return the expected result. So, rows 12 and 15 are marked instead of rows 13 and 16. Furthermore, there is a false alarm set for row 17.

There are three essential changes required: (1) use lag instead of lead and (2) supply a fill value to shift():

library(data.table)
setDT(x)[, error := State < 5 & shift(State, fill = 0) == 5, by = ID][]
    ID Visit Time State error
 1:  A     1 10.0     1 FALSE
 2:  A     2 12.5     3 FALSE
 3:  A     3 15.0     4 FALSE
 4:  B     1  2.0     1 FALSE
 5:  B     2  3.4     2 FALSE
 6:  B     3  5.7     3 FALSE
 7:  B     2  8.0     4 FALSE
 8:  B     3  9.5     3 FALSE
 9:  C     1  1.0     2 FALSE
10:  C     2  5.6     2 FALSE
11:  C     3  8.9     3 FALSE
12:  C     4 10.0     5 FALSE
13:  C     5 11.0     2  TRUE
14:  D     1  2.0     3 FALSE
15:  D     2  3.4     5 FALSE
16:  D     3  6.0     4  TRUE
17:  D     4  8.0     5 FALSE
18:  D     5 10.5     5 FALSE

Data

The third change is required for creating the sample data set.

cbind() returns a matrix which turns all columns into the same type which is factor in this case. So, all columns consisting of numbers are treated as factor. To avoid this, the sample data set needs to be defined as:

x <- data.frame(
  ID = c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
  Visit = c(1,2,3,1,2,3,2,3,1,2,3,4,5,1,2,3,4,5),
  Time = c(10,12.5,15,2,3.4,5.7,8,9.5,1,5.6,8.9,10,11,2,3.4,6,8,10.5),
  State = c(1,3,4,1,2,3,4,3,2,2,3,5,2,3,5,4,5,5))



回答3:


You can use data.table and shift like this

library(data.table)
setDT(x)[, status := ((State == 5) & (shift(State,1,"lead") != 5)), by = ID]
x
   ID Visit Time State status
1:  A     1   10     1  FALSE
2:  A     2 12.5     3  FALSE
3:  A     3   15     4  FALSE
4:  B     1    2     1  FALSE
5:  B     2  3.4     2  FALSE
6:  B     3  5.7     3  FALSE
7:  B     2    8     4  FALSE
8:  B     3  9.5     3  FALSE
9:  C     1    1     2  FALSE
10:  C     2  5.6     2  FALSE
11:  C     3  8.9     3  FALSE
12:  C     4   10     5   TRUE
13:  C     5   11     2  FALSE
14:  D     1    2     3  FALSE
15:  D     2  3.4     5   TRUE
16:  D     3    6     4  FALSE
17:  D     4    8     5   TRUE
18:  D     5 10.5     5  FALSE



回答4:


I'm still unclear what you'd like to do. Aren't rows 12, 15 and 17 the erroneous ones and should be removed?

do.call(rbind.data.frame, lapply(tmp, function(w) {
    idx <- diff(w$State) <= 0 & w$State[-length(w$State)] == 5;
    w[!idx, ];
}))
#     ID Visit Time State
#A.1   A     1   10     1
#A.2   A     2 12.5     3
#A.3   A     3   15     4
#B.4   B     1    2     1
#B.5   B     2  3.4     2
#B.7   B     2    8     4
#B.6   B     3  5.7     3
#B.8   B     3  9.5     3
#C.9   C     1    1     2
#C.10  C     2  5.6     2
#C.11  C     3  8.9     3
#C.13  C     5   11     2
#D.14  D     1    2     3
#D.16  D     3    6     4
#D.18  D     5 10.5     5


来源:https://stackoverflow.com/questions/47746656/how-to-remove-inconsistencies-from-dataframe-time-series

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!