问题
Issues regarding the command by
and weighted.mean
already exist but none was able to help solving my problem. I am new to R and am more used to data mining language than programming.
I have a data frame with for each individual (observation/row) the income, education level and sample weight. I want to calculate the weighted mean of income by education level, and I want the result to be associated to each individual in a new column of my original data frame, like this:
obs income education weight incomegroup
1. 1000 A 10 --> display weighted mean of income for education level A
2. 2000 B 1 --> display weighted mean of income for education level B
3. 1500 B 5 --> display weighted mean of income for education level B
4. 2000 A 2 --> display weighted mean of income for education level A
I tried:
data$incomegroup=by(data$education, function(x) weighted.mean(data$income, data$weight))
It does not work. The weighted mean is calculated somehow and appears in column "incomegroup" but for the whole set instead of by group or for one group only, I don't know. I read things regarding packages plyr
or aggregate
but it does not seem to do what I am interested in.
The ave{stats}
command gives exactly what I am looking for but only for simple mean:
data$incomegroup=ave(data$income,data$education,FUN = mean)
It cannot be used with weights.
Thanking you in advance for your help!
回答1:
If we use mutate
, then we can avoid the left_join
library(dplyr)
df %>%
group_by(education) %>%
mutate(weighted_income = weighted.mean(income, weight))
# obs income education weight weighted_income
# <int> <int> <fctr> <int> <dbl>
#1 1 1000 A 10 1166.667
#2 2 2000 B 1 1583.333
#3 3 1500 B 5 1583.333
#4 4 2000 A 2 1166.667
回答2:
Try using the dplyr package as follows:
df <- read.table(text = 'obs income education weight
1 1000 A 10
2 2000 B 1
3 1500 B 5
4 2000 A 2',
header = TRUE)
library(dplyr)
df_summary <-
df %>%
group_by(education) %>%
summarise(weighted_income = weighted.mean(income, weight))
df_summary
# education weighted_income
# A 1166.667
# B 1583.333
df_final <- left_join(df, df_summary, by = 'education')
df_final
# obs income education weight weighted_income
# 1 1000 A 10 1166.667
# 2 2000 B 1 1583.333
# 3 1500 B 5 1583.333
# 4 2000 A 2 1166.667
回答3:
There is a function weighted.mean
in base R. Unfortunately, it work easily with ave
. One solution is to use data.table
library(data.table)
setDT(data)
data[, incomeGroup := weighted.mean(income, weight), by=education]
data
income education weight incomeGroup
1: 1000 A 10 1166.667
2: 2000 B 1 1583.333
3: 1500 B 5 1583.333
4: 2000 A 2 1166.667
A bizarre method that does work with ave
is
ave(df[c("income", "weight")], df$education,
FUN=function(x) weighted.mean(x$income, x$weight))[[1]]
[1] 1166.667 1583.333 1583.333 1166.667
You feed the subset data.frame to the function and then group by your grouping variable. The FUN argument creates a function that takes a data.frame and applies weighted.mean
to the result. As the final output is a data.frame, the [[1]]
returns a vector with the desired result.
Note that this is just a proof that this is possible -- I wouldn't recommend this method, the data.frame
technique is much cleaner and will be much faster on data sets larger than 1000 observations.
来源:https://stackoverflow.com/questions/38509139/display-weighted-mean-by-group-in-the-data-frame