Produce a precision weighted average among rows with repeated observations

问题

I have a dataframe similar to the one generated below. Some individuals have more than one observation for a particular variable and each variable has an associated standard error (SE) for the estimate. I would like to create a new dataframe that contains only a single row for each individual. For individuals with more than one observation, such as Kim or Bob, I need to calculate a precision weighted average based on the standard errors of the estimates along with a variance for the newly calculated weighted mean. For example, for Bob, for var1, this means that I would want his var1 value in the new dataframe to be:

weighted.mean(c(example$var1[2], example$var1[10]), 
   c(1/example$SE1[2], 1/example$SE1[10]))

and for Bob's new SE1, which would be the variance of the weighted mean, to be:

1/sum(1/example$SE1[2] + 1/example$SE1[10])

I have tried using the aggregate function and am able to calculate the arithmetic mean of the values, but the simple function I wrote does not use the standard errors nor can it deal with the NAs.

aggregate(example[,1:4], by = list(example[,5]), mean)

Would appreciate any help in developing some code to work through this problem. Here is the example dataset.

set.seed(1562)
example=data.frame(rnorm(10,8,2))
colnames(example)[1]=("var1")
example$SE1=rnorm(10,2,1)
example$var2=rnorm(10,8,2)
example$SE2=rnorm(10,2,1)
example$id= 
  c ("Kim","Bob","Joe","Sam","Kim","Kim","Joe","Sara","Jeff","Bob")
example$SE1[5]=NA
example$var1[5]=NA
example$SE2[10]=NA
example$var2[10]=NA
example

       var1      SE1      var2        SE2   id
1   9.777769 2.451406  6.363250  2.2739566  Kim
2   8.753078 2.174308  6.219770  1.4978380  Bob
3   7.977356 2.107739  6.835998  2.1647437  Joe
4  11.113048 2.713242 11.091650  1.7018666  Sam
5         NA       NA 11.769884 -0.1310218  Kim
6   5.271308 1.831475  6.818854  3.0294338  Kim
7   7.770062 2.094850  6.387607  0.2272348  Joe
8   9.837612 1.956486  8.517445  3.5126378 Sara
9   4.637518 2.516896  7.173460  2.0292454 Jeff
10  9.004425 1.592312        NA         NA  Bob

回答1:

I like the plyr package for these sorts of problems. It should be functionally equivalent to aggregate, but I think it is nice and convenient to use. There are lots of examples and a great ~20 page intro to plyr on the website. For this problem, since the data starts as a data.frame and you want another data.frame on the other end, we use ddply()

library(plyr)
#f1()
ddply(example, "id", summarize, 
      newMean = weighted.mean(x=var1, 1/SE1, na.rm = TRUE),
      newSE = 1/sum(1/SE1, na.rm = TRUE)
      )

Which returns:

    id newmean   newSE
1  Bob  8.8982 0.91917
2 Jeff  4.6375 2.51690
3  Joe  7.8734 1.05064
4  Kim  7.1984 1.04829
5  Sam 11.1130 2.71324
6 Sara  9.8376 1.95649

Also check out ?summarize and ?transform for some other good background. You can also pass an anonymous function to the plyr functions if necessary for more complicated tasks.

Or use data.table package which can prove faster for some tasks:

library(data.table)
dt <- data.table(example, key="id")
#f2()
dt[, list(newMean = weighted.mean(var1, 1/SE1, na.rm = TRUE),
          newSE = 1/sum(1/SE1, na.rm = TRUE)),
   by = "id"]

A quick benchmark:

library(rbenchmark)
#f1 = plyr, #f2 = data.table
benchmark(f1(), f2(), 
          replications = 1000,
          order = "elapsed",
          columns = c("test", "elapsed", "relative"))

      test elapsed relative
    2 f2()   3.580   1.0000
    1 f1()   6.398   1.7872

So data.table() is ~ 1.8x faster for this dataset on my simple laptop.

来源：https://stackoverflow.com/questions/10407486/produce-a-precision-weighted-average-among-rows-with-repeated-observations

标签

plyr

data.table