for each group summarise means for all variables in dataframe (ddply? split?)

前端 未结 6 1074
难免孤独
难免孤独 2020-12-13 16:00

A week ago I would have done this manually: subset dataframe by group to new dataframes. For each dataframe compute means for each variables, then rbind. very clunky ...

6条回答
  •  执念已碎
    2020-12-13 16:36

    You can do this with by(). First set up some data:

    R> set.seed(42)
    R> testdf <- data.frame(var1=rnorm(100), var2=rnorm(100,2), var3=rnorm(100,3),  
                            group=as.factor(sample(letters[1:10],100,replace=T)),  
                            year=as.factor(sample(c(2007,2009),100,replace=T)))
    R> summary(testdf)
          var1              var2              var3          group      year   
     Min.   :-2.9931   Min.   :-0.0247   Min.   :0.30   e      :15   2007:50  
     1st Qu.:-0.6167   1st Qu.: 1.4085   1st Qu.:2.29   c      :14   2009:50  
     Median : 0.0898   Median : 1.9307   Median :2.98   f      :12            
     Mean   : 0.0325   Mean   : 1.9125   Mean   :2.99   h      :12            
     3rd Qu.: 0.6616   3rd Qu.: 2.4618   3rd Qu.:3.65   d      :11            
     Max.   : 2.2866   Max.   : 4.7019   Max.   :5.46   b      :10            
                                                        (Other):26  
    

    Use by():

    R> by(testdf[,1:3], testdf$year, mean)
    testdf$year: 2007
       var1    var2    var3 
    0.04681 1.77638 3.00122 
    --------------------------------------------------------------------- 
    testdf$year: 2009
       var1    var2    var3 
    0.01822 2.04865 2.97805 
    R> by(testdf[,1:3], list(testdf$group, testdf$year), mean)  
    ## longer answer by group and year suppressed
    

    You still need to reformat this for your table but it does give you the gist of your answer in one line.

    Edit: Further processing can be had via

    R> foo <- by(testdf[,1:3], list(testdf$group, testdf$year), mean)  
    R> do.call(rbind, foo)
              var1   var2  var3
     [1,]  0.62352 0.2549 3.157
     [2,]  0.08867 1.8313 3.607
     [3,] -0.69093 2.5431 3.094
     [4,]  0.02792 2.8068 3.181
     [5,] -0.26423 1.3269 2.781
     [6,]  0.07119 1.9453 3.284
     [7,] -0.10438 2.1181 3.783
     [8,]  0.21147 1.6345 2.470
     [9,]  1.17986 1.6518 2.362
    [10,] -0.42708 1.5683 3.144
    [11,] -0.82681 1.9528 2.740
    [12,] -0.27191 1.8333 3.090
    [13,]  0.15854 2.2830 2.949
    [14,]  0.16438 2.2455 3.100
    [15,]  0.07489 2.1798 2.451
    [16,] -0.03479 1.6800 3.099
    [17,]  0.48082 1.8883 2.569
    [18,]  0.32381 2.4015 3.332
    [19,] -0.47319 1.5016 2.903
    [20,]  0.11743 2.2645 3.452
    R> do.call(rbind, dimnames(foo))
         [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]   [,10] 
    [1,] "a"    "b"    "c"    "d"    "e"    "f"    "g"    "h"    "i"    "j"   
    [2,] "2007" "2009" "2007" "2009" "2007" "2009" "2007" "2009" "2007" "2009"
    

    You can play with the dimnames some more:

    R> expand.grid(dimnames(foo))
       Var1 Var2
    1     a 2007
    2     b 2007
    3     c 2007
    4     d 2007
    5     e 2007
    6     f 2007
    7     g 2007
    8     h 2007
    9     i 2007
    10    j 2007
    11    a 2009
    12    b 2009
    13    c 2009
    14    d 2009
    15    e 2009
    16    f 2009
    17    g 2009
    18    h 2009
    19    i 2009
    20    j 2009
    R> 
    

    Edit: And with that, we can create a data.frame for the result without resorting to external packages using only base R:

    R> data.frame(cbind(expand.grid(dimnames(foo)), do.call(rbind, foo)))
       Var1 Var2     var1   var2  var3
    1     a 2007  0.62352 0.2549 3.157
    2     b 2007  0.08867 1.8313 3.607
    3     c 2007 -0.69093 2.5431 3.094
    4     d 2007  0.02792 2.8068 3.181
    5     e 2007 -0.26423 1.3269 2.781
    6     f 2007  0.07119 1.9453 3.284
    7     g 2007 -0.10438 2.1181 3.783
    8     h 2007  0.21147 1.6345 2.470
    9     i 2007  1.17986 1.6518 2.362
    10    j 2007 -0.42708 1.5683 3.144
    11    a 2009 -0.82681 1.9528 2.740
    12    b 2009 -0.27191 1.8333 3.090
    13    c 2009  0.15854 2.2830 2.949
    14    d 2009  0.16438 2.2455 3.100
    15    e 2009  0.07489 2.1798 2.451
    16    f 2009 -0.03479 1.6800 3.099
    17    g 2009  0.48082 1.8883 2.569
    18    h 2009  0.32381 2.4015 3.332
    19    i 2009 -0.47319 1.5016 2.903
    20    j 2009  0.11743 2.2645 3.452
    R> 
    

提交回复
热议问题