Subtract previous year's from value from each grouped row in data frame

前端 未结 3 480
遥遥无期
遥遥无期 2020-12-29 15:46

I am trying to calculated the lagged difference (or actual increase) for data that has been inadvertently aggregated. Each successive year in the data includes values from t

3条回答
  •  一向
    一向 (楼主)
    2020-12-29 16:11

    I think this will work for you. When you run into the diff problem just lengthen the vector by putting 0 in as the first number.

    df <- df[order(df$id, df$year), ]
    sdf <-split(df, df$id)
    df$actual <- as.vector(sapply(seq_along(sdf), function(x) diff(c(0, sdf[[x]][,2]))))
    df[order(as.numeric(rownames(df))),]
    

    There's lots of ways to do this but this one is fairly fast and uses base.

    Here's a second & third way of approaching this problem utilizing aggregate and by:

    aggregate:

    df <- df[order(df$id, df$year), ]
    diff2 <- function(x) diff(c(0, x))
    df$actual <- c(unlist(t(aggregate(value~id, df, diff2)[, -1])))
    df[order(as.numeric(rownames(df))),]
    

    by:

    df <- df[order(df$id, df$year), ]
    diff2 <- function(x) diff(c(0, x))
    df$actual <- unlist(by(df$value, df$id, diff2))
    df[order(as.numeric(rownames(df))),]
    

    plyr

    df <- df[order(df$id, df$year), ]
    df <- data.frame(temp=1:nrow(df), df)
    library(plyr)
    df <- ddply(df, .(id), transform, actual=diff2(value))
    df[order(-df$year, df$temp),][, -1]
    

    It gives you the final product of:

    > df[order(as.numeric(rownames(df))),]
       id value year actual
    1   1    21    3      5
    2   2    26    3     16
    3   3    26    3     14
    4   4    26    3     10
    5   5    29    3     14
    6   1    16    2     10
    7   2    10    2      5
    8   3    12    2     10
    9   4    16    2      7
    10  5    15    2     13
    11  1     6    1      6
    12  2     5    1      5
    13  3     2    1      2
    14  4     9    1      9
    15  5     2    1      2
    

    EDIT: Avoiding the Loop

    May I suggest avoiding the loop and turning what I gave to you into a function (the by solution is the easiest one for me to work with) and sapply that to the two columns you desire.

    set.seed(1234)  #make new data with another numeric column
    x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
    y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
    z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
    df <- rbind(x, y, z)
    df <- df.rep <- data.frame(df[, 1:2], new.var=df[, 2]+sample(1:5, nrow(df), 
              replace=T), year=df[, 3])
    
    
    df <- df[order(df$id, df$year), ]
    diff2 <- function(x) diff(c(0, x))                   #function one
    group.diff<- function(x) unlist(by(x, df$id, diff2)) #answer turned function
    df <- data.frame(df, sapply(df[, 2:3], group.diff))  #apply group.diff to col 2:3
    df[order(as.numeric(rownames(df))),]                 #reorder it
    

    Of course you'd have to rename these unless you used transform as in:

    df <- df[order(df$id, df$year), ]
    diff2 <- function(x) diff(c(0, x))                   #function one
    group.diff<- function(x) unlist(by(x, df$id, diff2)) #answer turned function
    df <- transform(df, actual=group.diff(value), actual.new=group.diff(new.var))   
    df[order(as.numeric(rownames(df))),]
    

    This would depend on how many variables you were doing this to.

提交回复
热议问题