This question already has an answer here:
I have a dataframe recording how much money a costomer spend in detail like the following:
custid, value
1, 1
1, 3
1, 2
1, 5
1, 4
1, 1
2, 1
2, 10
3, 1
3, 2
3, 5
How to calcuate the charicteristics using mean,max,median,std, etc like the following? Use some apply function? And how?
custid, mean, max,min,median,std
1, ....
2,....
3,....
To add to the alternatives, here's summaryBy
from the "doBy" package, with which you can specify a list
of functions to apply.
library(doBy)
summaryBy(value ~ custid, data = mydf,
FUN = list(mean, max, min, median, sd))
# custid value.mean value.max value.min value.median value.sd
# 1 1 2.666667 5 1 2.5 1.632993
# 2 2 5.500000 10 1 5.5 6.363961
# 3 3 2.666667 5 1 2.0 2.081666
Of course, you can also stick with base R:
myFun <- function(x) {
c(min = min(x), max = max(x),
mean = mean(x), median = median(x),
std = sd(x))
}
tapply(mydf$value, mydf$custid, myFun)
# $`1`
# min max mean median std
# 1.000000 5.000000 2.666667 2.500000 1.632993
#
# $`2`
# min max mean median std
# 1.000000 10.000000 5.500000 5.500000 6.363961
#
# $`3`
# min max mean median std
# 1.000000 5.000000 2.666667 2.000000 2.081666
cbind(custid = unique(mydf$custid),
do.call(rbind, tapply(mydf$value, mydf$custid, myFun)))
# custid min max mean median std
# 1 1 1 5 2.666667 2.5 1.632993
# 2 2 1 10 5.500000 5.5 6.363961
# 3 3 1 5 2.666667 2.0 2.081666
library(dplyr)
dat%>%
group_by(custid)%>%
summarise(Mean=mean(value), Max=max(value), Min=min(value), Median=median(value), Std=sd(value))
# custid Mean Max Min Median Std
#1 1 2.666667 5 1 2.5 1.632993
#2 2 5.500000 10 1 5.5 6.363961
#3 3 2.666667 5 1 2.0 2.081666
For bigger datasets, data.table
would be faster
setDT(dat)[,list(Mean=mean(value), Max=max(value), Min=min(value), Median=as.numeric(median(value)), Std=sd(value)), by=custid]
# custid Mean Max Min Median Std
#1: 1 2.666667 5 1 2.5 1.632993
#2: 2 5.500000 10 1 5.5 6.363961
#3: 3 2.666667 5 1 2.0 2.081666
If you want to apply a larger number of functions to all or the same column(s) with dplyr
I recommend summarise_each
or mutate_each
:
require(dplyr)
dat %>%
group_by(custid) %>%
summarise_each(funs(max, min, mean, median, sd), value)
#Source: local data frame [3 x 6]
#
# custid max min mean median sd
#1 1 5 1 2.666667 2.5 1.632993
#2 2 10 1 5.500000 5.5 6.363961
#3 3 5 1 2.666667 2.0 2.081666
Or another option with base R's aggregate
:
aggregate(value ~ custid, data = dat, summary)
# custid value.Min. value.1st Qu. value.Median value.Mean value.3rd Qu. value.Max.
#1 1 1.000 1.250 2.500 2.667 3.750 5.000
#2 2 1.000 3.250 5.500 5.500 7.750 10.000
#3 3 1.000 1.500 2.000 2.667 3.500 5.000
(This doesn't include standard deviation but I think it's a nice approach for the other descriptive stats.)
I like describeBy()
from the psych
package. Like this
df <- structure(list(custid. = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L,
3L, 3L), value = c(1L, 3L, 2L, 5L, 4L, 1L, 1L, 10L, 1L, 2L, 5L
)), .Names = c("custid.", "value"), class = "data.frame", row.names = c(NA,
-11L))
df
custid. value
1 1 1
2 1 3
3 1 2
4 1 5
5 1 4
6 1 1
7 2 1
8 2 10
9 3 1
10 3 2
11 3 5
# install.packages(c("psych"), dependencies = TRUE)
require(psych)
describeBy(df$value, df$custid.)
group: 1
vars n mean sd median trimmed mad min max range skew kurtosis se
1 1 6 2.67 1.63 2.5 2.67 2.22 1 5 4 0.21 -1.86 0.67
-----------------------------------------------------------------------
group: 2
vars n mean sd median trimmed mad min max range skew kurtosis se
1 1 2 5.5 6.36 5.5 5.5 6.67 1 10 9 0 -2.75 4.5
-----------------------------------------------------------------------
group: 3
vars n mean sd median trimmed mad min max range skew kurtosis se
1 1 3 2.67 2.08 2 2.67 1.48 1 5 4 0.29 -2.33 1.2
Or get it as a matrix if you prefer that,
describeBy(df$value, df$custid., mat=T, skew = F)
item group1 vars n mean sd median min max range se
11 1 1 1 6 2.666667 1.632993 2.5 1 5 4 0.6666667
12 2 2 1 2 5.500000 6.363961 5.5 1 10 9 4.5000000
13 3 3 1 3 2.666667 2.081666 2.0 1 5 4 1.2018504
You can use plyr package
Split apply combine strategy
ddply(dataframe, .(groupcol), function)
In your case
ddply(dataframe, .(custid), summarize, "mean"= mean(value), "median" = median(value))
Take a look at the help for ddply you have a good example for you
来源:https://stackoverflow.com/questions/25198442/how-to-calculate-mean-median-per-group-in-a-dataframe-in-r