Check time series incongruencies

此生再无相见时 提交于 2019-12-11 06:18:41

问题


Let's say that we have the following matrix:

x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
                        c(1,2,3,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
                        c(14,28,42,14,46,64,71,85,14,28,51,84,66,22,38,32,40,42)))
colnames(x)<- c("ID","Visit", "Age")

The first column represents subject ID, the second a list of observations and the third the age at each of this consecutive observations.

Which would be the easiest way of finding visits where the age is wrong according to the previous visit age. (i.e. in row 13, subject C is 66 years old, when in the previous visit he was already 84 or in row 16, subject D is 32 years old, when in the previous visit he was already 38).

Which would be the way of highlighting the potential errors and removing rows 13 and 16?

I have tried to aggregate by IDs and look for the difference between ages across visits, but it seems hard for me since the error could occur in any visit.


回答1:


How about this in base R?

df <- do.call(rbind.data.frame, lapply(split(x, x$ID), function(w) 
    w[c(1, which(diff(w[order(w$Visit), "Age"]) > 0) + 1), ]));
df;
#    ID Visit Age
#A.1   A     1  14
#A.2   A     2  28
#A.3   A     3  42
#B.4   B     1  14
#B.5   B     2  46
#B.6   B     3  64
#B.7   B     4  71
#B.8   B     5  85
#C.9   C     1  14
#C.10  C     2  28
#C.11  C     3  51
#C.12  C     4  84
#D.14  D     1  22
#D.15  D     2  38
#D.17  D     4  40
#D.18  D     5  42    

Explanation: We split the dataframe on column ID, then order every dataframe subset by Visit, calculate differences between successive Age values, and only keep those rows where the difference is > 0 (i.e. Age is increasing); rbinding gives the final dataframe.




回答2:


You could do it by filtering out the rows where diff(Age) is negative for each ID. Using the dplyr package:

library(dplyr)

x %>% group_by(ID) %>%  filter(c(0,diff(Age))>=0)
# A tibble: 16 x 3
# Groups:   ID [4]
       ID  Visit    Age
   <fctr> <fctr> <fctr>
 1      A      1     14
 2      A      2     28
 3      A      3     42
 4      B      1     14
 5      B      2     46
 6      B      3     64
 7      B      4     71
 8      B      5     85
 9      C      1     14
10      C      2     28
11      C      3     51
12      C      4     84
13      D      1     22
14      D      2     38
15      D      4     40
16      D      5     42



回答3:


The aggregate() approach is pretty concise.
Removing bad rows

good <- do.call(c, aggregate(Age ~ ID, x, function(z) c(z[1], diff(z)) > 0)$Age)

x[good,]
#    ID Visit Age
# 1   A     1  14
# 2   A     2  28
# 3   A     3  42
# 4   B     1  14
# 5   B     2  46
# 6   B     3  64
# 7   B     4  71
# 8   B     5  85
# 9   C     1  14
# 10  C     2  28
# 11  C     3  51
# 12  C     4  84
# 14  D     1  22
# 15  D     2  38
# 17  D     4  40
# 18  D     5  42

This will only highlight which groups have an inconsistency:

aggregate(Age ~ ID, x, function(z) all(diff(z) > 0))
#   ID   Age
# 1  A  TRUE
# 2  B  TRUE
# 3  C FALSE
# 4  D FALSE


来源:https://stackoverflow.com/questions/47723890/check-time-series-incongruencies

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!