how to calculate mean/median per group in a dataframe in r [duplicate]

This question already has an answer here:

Mean per group in a data.frame [duplicate] 8 answers

I have a dataframe recording how much money a costomer spend in detail like the following:

custid, value
1,  1
1,  3
1,  2
1,  5
1,  4
1,  1
2,  1
2,  10
3,  1
3,  2
3,  5

How to calcuate the charicteristics using mean,max,median,std, etc like the following? Use some apply function? And how?

custid, mean, max,min,median,std
1,  ....
2,....
3,....

To add to the alternatives, here's summaryBy from the "doBy" package, with which you can specify a list of functions to apply.

library(doBy)
summaryBy(value ~ custid, data = mydf, 
          FUN = list(mean, max, min, median, sd))
#   custid value.mean value.max value.min value.median value.sd
# 1      1   2.666667         5         1          2.5 1.632993
# 2      2   5.500000        10         1          5.5 6.363961
# 3      3   2.666667         5         1          2.0 2.081666

Of course, you can also stick with base R:

myFun <- function(x) {
  c(min = min(x), max = max(x), 
    mean = mean(x), median = median(x), 
    std = sd(x))
}

tapply(mydf$value, mydf$custid, myFun)
# $`1`
#      min      max     mean   median      std 
# 1.000000 5.000000 2.666667 2.500000 1.632993 
# 
# $`2`
#       min       max      mean    median       std 
#  1.000000 10.000000  5.500000  5.500000  6.363961 
# 
# $`3`
#      min      max     mean   median      std 
# 1.000000 5.000000 2.666667 2.000000 2.081666 

cbind(custid = unique(mydf$custid), 
      do.call(rbind, tapply(mydf$value, mydf$custid, myFun)))
#   custid min max     mean median      std
# 1      1   1   5 2.666667    2.5 1.632993
# 2      2   1  10 5.500000    5.5 6.363961
# 3      3   1   5 2.666667    2.0 2.081666

library(dplyr)
dat%>%
group_by(custid)%>% 
summarise(Mean=mean(value), Max=max(value), Min=min(value), Median=median(value), Std=sd(value))
#  custid     Mean Max Min Median      Std
#1      1 2.666667   5   1    2.5 1.632993
#2      2 5.500000  10   1    5.5 6.363961
#3      3 2.666667   5   1    2.0 2.081666

For bigger datasets, data.table would be faster

setDT(dat)[,list(Mean=mean(value), Max=max(value), Min=min(value), Median=as.numeric(median(value)), Std=sd(value)), by=custid]
#   custid     Mean Max Min Median      Std
#1:      1 2.666667   5   1    2.5 1.632993
#2:      2 5.500000  10   1    5.5 6.363961
#3:      3 2.666667   5   1    2.0 2.081666

If you want to apply a larger number of functions to all or the same column(s) with dplyr I recommend summarise_each or mutate_each:

require(dplyr)
dat %>%
  group_by(custid) %>%
  summarise_each(funs(max, min, mean, median, sd), value)
#Source: local data frame [3 x 6]
#
#  custid max min     mean median       sd
#1      1   5   1 2.666667    2.5 1.632993
#2      2  10   1 5.500000    5.5 6.363961
#3      3   5   1 2.666667    2.0 2.081666

Or another option with base R's aggregate:

aggregate(value ~ custid, data = dat, summary)
#  custid value.Min. value.1st Qu. value.Median value.Mean value.3rd Qu. value.Max.
#1      1      1.000         1.250        2.500      2.667         3.750      5.000
#2      2      1.000         3.250        5.500      5.500         7.750     10.000
#3      3      1.000         1.500        2.000      2.667         3.500      5.000

(This doesn't include standard deviation but I think it's a nice approach for the other descriptive stats.)

I like describeBy() from the psych package. Like this

df <- structure(list(custid. = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 
3L, 3L), value = c(1L, 3L, 2L, 5L, 4L, 1L, 1L, 10L, 1L, 2L, 5L
)), .Names = c("custid.", "value"), class = "data.frame", row.names = c(NA, 
-11L))
df
       custid. value
1        1     1
2        1     3
3        1     2
4        1     5
5        1     4
6        1     1
7        2     1
8        2    10
9        3     1
10       3     2
11       3     5
# install.packages(c("psych"), dependencies = TRUE)
require(psych)

 describeBy(df$value, df$custid.)
group: 1
  vars n mean   sd median trimmed  mad min max range skew kurtosis   se
1    1 6 2.67 1.63    2.5    2.67 2.22   1   5     4 0.21    -1.86 0.67
----------------------------------------------------------------------- 
group: 2
  vars n mean   sd median trimmed  mad min max range skew kurtosis  se
1    1 2  5.5 6.36    5.5     5.5 6.67   1  10     9    0    -2.75 4.5
----------------------------------------------------------------------- 
group: 3
  vars n mean   sd median trimmed  mad min max range skew kurtosis  se
1    1 3 2.67 2.08      2    2.67 1.48   1   5     4 0.29    -2.33 1.2

Or get it as a matrix if you prefer that,

 describeBy(df$value, df$custid., mat=T, skew = F)
   item group1 vars n     mean       sd median min max range        se
11    1      1    1 6 2.666667 1.632993    2.5   1   5     4 0.6666667
12    2      2    1 2 5.500000 6.363961    5.5   1  10     9 4.5000000
13    3      3    1 3 2.666667 2.081666    2.0   1   5     4 1.2018504

You can use plyr package

Split apply combine strategy

ddply(dataframe, .(groupcol), function)

In your case

ddply(dataframe, .(custid), summarize, "mean"= mean(value), "median" = median(value))

Take a look at the help for ddply you have a good example for you

来源：https://stackoverflow.com/questions/25198442/how-to-calculate-mean-median-per-group-in-a-dataframe-in-r

标签

mean

median