sum multiple columns by group with tapply

前端 未结 3 1962
北荒
北荒 2020-12-15 23:03

I wanted to sum individual columns by group and my first thought was to use tapply. However, I cannot get tapply to work. Can tapply

相关标签:
3条回答
  • 2020-12-15 23:45

    I looked at the source code for by, as EDi suggested. That code was substantially more complex than my change to the one line in tapply. I have now found that my.tapply does not work with the more complex scenario below where apples and cherries are summed by state and county. If I get my.tapply to work with this case I can post the code here later:

    df.2 <- read.table(text = '
    
        state   county   apples   cherries   plums
           AA        1        1          2       3
           AA        1        1          2       3
           AA        2       10         20      30
           AA        2       10         20      30
           AA        3      100        200     300
           AA        3      100        200     300
    
           BB        7       -1         -2      -3
           BB        7       -1         -2      -3
           BB        8      -10        -20     -30
           BB        8      -10        -20     -30
           BB        9     -100       -200    -300
           BB        9     -100       -200    -300
    
    ', header = TRUE, stringsAsFactors = FALSE)
    
    # my function works
    
       tapply(df.2$apples  , list(df.2$state, df.2$county), function(x) {sum(x)})
    my.tapply(df.2$apples  , list(df.2$state, df.2$county), function(x) {sum(x)})
    
    # my function works
    
       tapply(df.2$cherries, list(df.2$state, df.2$county), function(x) {sum(x)})
    my.tapply(df.2$cherries, list(df.2$state, df.2$county), function(x) {sum(x)})
    
    # my function does not work
    
    my.tapply(df.2[,3:4], list(df.2$state, df.2$county), function(x) {colSums(x)})
    
    0 讨论(0)
  • 2020-12-15 23:49

    You're looking for by. It uses the INDEX in the way that you assumed tapply would, by row.

    by(df.1, df.1$state, function(x) colSums(x[,3:5]))
    

    The problem with your use of tapply is that you were indexing the data.frame by column. (Because data.frame is really just a list of columns.) So, tapply complained that your index didn't match the length of your data.frame which is 5.

    0 讨论(0)
  • 2020-12-16 00:06

    tapply works on a vector, for a data.frame you can use by (which is a wrapper for tapply, take a look at the code):

    > by(df.1[,c(3:5)], df.1$state, FUN=colSums)
    df.1$state: AA
      apples cherries    plums 
         111      222      333 
    ------------------------------------------------------------------------------------- 
    df.1$state: BB
      apples cherries    plums 
        -111     -222     -333 
    
    0 讨论(0)
提交回复
热议问题