问题
I have a dataframe similar to the one generated below. Some individuals have more than one observation for a particular variable and each variable has an associated standard error (SE) for the estimate. I would like to create a new dataframe that contains only a single row for each individual. For individuals with more than one observation, such as Kim or Bob, I need to calculate a precision weighted average based on the standard errors of the estimates along with a variance for the newly calculated weighted mean. For example, for Bob, for var1, this means that I would want his var1 value in the new dataframe to be:
weighted.mean(c(example$var1[2], example$var1[10]),
c(1/example$SE1[2], 1/example$SE1[10]))
and for Bob's new SE1, which would be the variance of the weighted mean, to be:
1/sum(1/example$SE1[2] + 1/example$SE1[10])
I have tried using the aggregate function and am able to calculate the arithmetic mean of the values, but the simple function I wrote does not use the standard errors nor can it deal with the NAs.
aggregate(example[,1:4], by = list(example[,5]), mean)
Would appreciate any help in developing some code to work through this problem. Here is the example dataset.
set.seed(1562)
example=data.frame(rnorm(10,8,2))
colnames(example)[1]=("var1")
example$SE1=rnorm(10,2,1)
example$var2=rnorm(10,8,2)
example$SE2=rnorm(10,2,1)
example$id=
c ("Kim","Bob","Joe","Sam","Kim","Kim","Joe","Sara","Jeff","Bob")
example$SE1[5]=NA
example$var1[5]=NA
example$SE2[10]=NA
example$var2[10]=NA
example
var1 SE1 var2 SE2 id
1 9.777769 2.451406 6.363250 2.2739566 Kim
2 8.753078 2.174308 6.219770 1.4978380 Bob
3 7.977356 2.107739 6.835998 2.1647437 Joe
4 11.113048 2.713242 11.091650 1.7018666 Sam
5 NA NA 11.769884 -0.1310218 Kim
6 5.271308 1.831475 6.818854 3.0294338 Kim
7 7.770062 2.094850 6.387607 0.2272348 Joe
8 9.837612 1.956486 8.517445 3.5126378 Sara
9 4.637518 2.516896 7.173460 2.0292454 Jeff
10 9.004425 1.592312 NA NA Bob
回答1:
I like the plyr
package for these sorts of problems. It should be functionally equivalent to aggregate
, but I think it is nice and convenient to use. There are lots of examples and a great ~20 page intro to plyr on the website. For this problem, since the data starts as a data.frame and you want another data.frame on the other end, we use ddply()
library(plyr)
#f1()
ddply(example, "id", summarize,
newMean = weighted.mean(x=var1, 1/SE1, na.rm = TRUE),
newSE = 1/sum(1/SE1, na.rm = TRUE)
)
Which returns:
id newmean newSE
1 Bob 8.8982 0.91917
2 Jeff 4.6375 2.51690
3 Joe 7.8734 1.05064
4 Kim 7.1984 1.04829
5 Sam 11.1130 2.71324
6 Sara 9.8376 1.95649
Also check out ?summarize
and ?transform
for some other good background. You can also pass an anonymous function to the plyr
functions if necessary for more complicated tasks.
Or use data.table
package which can prove faster for some tasks:
library(data.table)
dt <- data.table(example, key="id")
#f2()
dt[, list(newMean = weighted.mean(var1, 1/SE1, na.rm = TRUE),
newSE = 1/sum(1/SE1, na.rm = TRUE)),
by = "id"]
A quick benchmark:
library(rbenchmark)
#f1 = plyr, #f2 = data.table
benchmark(f1(), f2(),
replications = 1000,
order = "elapsed",
columns = c("test", "elapsed", "relative"))
test elapsed relative
2 f2() 3.580 1.0000
1 f1() 6.398 1.7872
So data.table()
is ~ 1.8x faster for this dataset on my simple laptop.
来源:https://stackoverflow.com/questions/10407486/produce-a-precision-weighted-average-among-rows-with-repeated-observations