select last observation from longitudinal data

前端 未结 6 1226
小蘑菇
小蘑菇 2020-12-28 09:19

I have a data set with several time assessments for each participant. I want to select the last assessment for each participant. My dataset looks like this:

         


        
6条回答
  •  难免孤独
    2020-12-28 10:02

    I've been trying to use split and tapply a bit more to become more acquainted with them. I know this question have been answered already but I thought I'd add another solotuion using split (pardon the ugliness; I'm more than open to feedback for improvement; thought maybe there was a use to tapply to lessen the code):

    sdf <-with(df, split(df, ID))
    max.week <- sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))
    data.frame(t(mapply(function(x, y) y[x, ], max.week, sdf)))
    

    I also figured why we have 7 answers here it was ripe for a benchmark. The results may surprise you (using rbenchmark with R2.14.1 on a Win 7 machine):

    # library(rbenchmark)
    # benchmark(
    #     DATA.TABLE= {dt <- data.table(df, key="ID")
    #         dt[, .SD[which.max(outcome),], by=ID]},
    #     DO.CALL={do.call("rbind", 
    #         by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week),]))},
    #     PLYR=ddply(df, .(ID), function(X) X[which.max(X$week), ]),
    #     SPLIT={sdf <-with(df, split(df, ID))
    #         max.week <- sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))
    #         data.frame(t(mapply(function(x, y) y[x, ], max.week, sdf)))},
    #     MATCH.INDEX=df[rev(rownames(df)),][match(unique(df$ID), rev(df$ID)), ],
    #     AGGREGATE=df[cumsum(aggregate(week ~ ID, df, which.max)$week), ],
    #     #WHICH.MAX.INDEX=df[sapply(unique(df$ID), function(x) which.max(x==df$ID)), ],
    #     BRYANS.INDEX = df[cumsum(as.numeric(lapply(split(df$week, df$ID), 
    #         which.max))), ],
    #     SPLIT2={sdf <-with(df, split(df, ID))
    #         df[cumsum(sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))),
    #         ]},
    #     TAPPLY=df[tapply(seq_along(df$ID), df$ID, function(x){tail(x,1)}),],
    # columns = c( "test", "replications", "elapsed", "relative", "user.self","sys.self"), 
    # order = "test", replications = 1000, environment = parent.frame())
    
              test replications elapsed  relative user.self sys.self
    6    AGGREGATE         1000    4.49  7.610169      2.84     0.05
    7 BRYANS.INDEX         1000    0.59  1.000000      0.20     0.00
    1   DATA.TABLE         1000   20.28 34.372881     11.98     0.00
    2      DO.CALL         1000    4.67  7.915254      2.95     0.03
    5  MATCH.INDEX         1000    1.07  1.813559      0.51     0.00
    3         PLYR         1000   10.61 17.983051      5.07     0.00
    4        SPLIT         1000    3.12  5.288136      1.81     0.00
    8       SPLIT2         1000    1.56  2.644068      1.28     0.00
    9       TAPPLY         1000    1.08  1.830508      0.88     0.00
    

    Edit1: I omitted the WHICH MAX solution as it does not return the correct results and returned an AGGREGATE solution as well that I wanted to use (compliments of Bryan Goodrich) and an updated version of split, SPLIT2, using cumsum (I liked that move).

    Edit 2: Dason also chimed in with a tapply solution I threw into the test that fared pretty well too.

提交回复
热议问题