可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
This question already has an answer here:
I have a dataframe recording how much money a costomer spend in detail like the following:
custid, value 1, 1 1, 3 1, 2 1, 5 1, 4 1, 1 2, 1 2, 10 3, 1 3, 2 3, 5
How to calcuate the charicteristics using mean,max,median,std, etc like the following? Use some apply function? And how?
custid, mean, max,min,median,std 1, .... 2,.... 3,....
回答1:
To add to the alternatives, here's summaryBy
from the "doBy" package, with which you can specify a list
of functions to apply.
library(doBy) summaryBy(value ~ custid, data = mydf, FUN = list(mean, max, min, median, sd)) # custid value.mean value.max value.min value.median value.sd # 1 1 2.666667 5 1 2.5 1.632993 # 2 2 5.500000 10 1 5.5 6.363961 # 3 3 2.666667 5 1 2.0 2.081666
Of course, you can also stick with base R:
myFun
回答2:
library(dplyr) dat%>% group_by(custid)%>% summarise(Mean=mean(value), Max=max(value), Min=min(value), Median=median(value), Std=sd(value)) # custid Mean Max Min Median Std #1 1 2.666667 5 1 2.5 1.632993 #2 2 5.500000 10 1 5.5 6.363961 #3 3 2.666667 5 1 2.0 2.081666
For bigger datasets, data.table
would be faster
setDT(dat)[,list(Mean=mean(value), Max=max(value), Min=min(value), Median=as.numeric(median(value)), Std=sd(value)), by=custid] # custid Mean Max Min Median Std #1: 1 2.666667 5 1 2.5 1.632993 #2: 2 5.500000 10 1 5.5 6.363961 #3: 3 2.666667 5 1 2.0 2.081666
回答3:
If you want to apply a larger number of functions to all or the same column(s) with dplyr
I recommend summarise_each
or mutate_each
:
require(dplyr) dat %>% group_by(custid) %>% summarise_each(funs(max, min, mean, median, sd), value) #Source: local data frame [3 x 6] # # custid max min mean median sd #1 1 5 1 2.666667 2.5 1.632993 #2 2 10 1 5.500000 5.5 6.363961 #3 3 5 1 2.666667 2.0 2.081666
Or another option with base R's aggregate
:
aggregate(value ~ custid, data = dat, summary) # custid value.Min. value.1st Qu. value.Median value.Mean value.3rd Qu. value.Max. #1 1 1.000 1.250 2.500 2.667 3.750 5.000 #2 2 1.000 3.250 5.500 5.500 7.750 10.000 #3 3 1.000 1.500 2.000 2.667 3.500 5.000
(This doesn't include standard deviation but I think it's a nice approach for the other descriptive stats.)
回答4:
I like describeBy()
from the psych
package. Like this
df
Or get it as a matrix if you prefer that,
describeBy(df$value, df$custid., mat=T, skew = F) item group1 vars n mean sd median min max range se 11 1 1 1 6 2.666667 1.632993 2.5 1 5 4 0.6666667 12 2 2 1 2 5.500000 6.363961 5.5 1 10 9 4.5000000 13 3 3 1 3 2.666667 2.081666 2.0 1 5 4 1.2018504
回答5:
You can use plyr package
Split apply combine strategy
ddply(dataframe, .(groupcol), function)
In your case
ddply(dataframe, .(custid), summarize, "mean"= mean(value), "median" = median(value))
Take a look at the help for ddply you have a good example for you