How do I create a column of means of specific columns in a data.frame?

北城以北 提交于 2019-12-24 12:31:14

问题


Thanks all for your responses and answers. I can see I've unintentionally left out some important details that may help you understand my problem better. I was trying to keep it simple and generic, but that didn't actually help. Here's an updated version with more information.

I have a data.frame with many columns that came from a NetLogo model generated by BehaviorSpace. Each column is a time series that represents a reported value under different experimental conditions with repetitions represented by the run number and time step number. For example (sorry this is long but I'm trying to give you a flavor for the data):

# Start by building a fake data.frame that models some of the characteristics of mine:
df <- data.frame(run = c(rep(1,5), rep(2,5), rep(3,5), rep(4,5), rep(5,5), rep(6,5), rep(7,5), rep(8,5)))
df2 <- expand.grid(step = 1:5, fac.a = c(10,1000), fac.b = c(0.5,2.0))
df <- data.frame(run = df$run, rep = c(rep(1,20), rep(2,20)), step = df2$step, fac.a = df2$fac.a, fac.b = df2$fac.b)
log_growth <- function (a, b, x) {(1/(1+a*exp(-b*x))) + rnorm(1,0,0.2)}
set.seed(11)
df$treatment1 <- log_growth(df$fac.a, df$fac.b, df$step)
df$treatment2 <- log_growth(df$fac.a / 2, df$fac.b * 2, df$step)

This puts the following into df:

> df
   run rep step fac.a fac.b  treatment1  treatment2
1    1   1    1    10   0.5  0.05288201 0.356176584
2    1   1    2    10   0.5  0.12507561 0.600407158
3    1   1    3    10   0.5  0.22081815 0.804671117
4    1   1    4    10   0.5  0.33627099 0.920093934
5    1   1    5    10   0.5  0.46053940 0.971397427
6    2   1    1  1000   0.5 -0.08700866 0.009396323
7    2   1    2  1000   0.5 -0.08594375 0.018552055
8    2   1    3  1000   0.5 -0.08419297 0.042608835
9    2   1    4  1000   0.5 -0.08131981 0.102435481
10   2   1    5  1000   0.5 -0.07661880 0.232875872
11   3   1    1    10   2.0  0.33627099 0.920093934
12   3   1    2    10   2.0  0.75654214 1.002314651
13   3   1    3    10   2.0  0.88715737 1.003958435
14   3   1    4    10   2.0  0.90800192 1.003988593
15   3   1    5    10   2.0  0.91089154 1.003989145
16   4   1    1  1000   2.0 -0.08131981 0.102435481
17   4   1    2  1000   2.0 -0.03688314 0.860350536
18   4   1    3  1000   2.0  0.19880473 1.000926458
19   4   1    4  1000   2.0  0.66014952 1.003932891
20   4   1    5  1000   2.0  0.86791705 1.003988125
21   5   2    1    10   0.5  0.05288201 0.356176584
22   5   2    2    10   0.5  0.12507561 0.600407158
23   5   2    3    10   0.5  0.22081815 0.804671117
24   5   2    4    10   0.5  0.33627099 0.920093934
25   5   2    5    10   0.5  0.46053940 0.971397427
26   6   2    1  1000   0.5 -0.08700866 0.009396323
27   6   2    2  1000   0.5 -0.08594375 0.018552055
28   6   2    3  1000   0.5 -0.08419297 0.042608835
29   6   2    4  1000   0.5 -0.08131981 0.102435481
30   6   2    5  1000   0.5 -0.07661880 0.232875872
31   7   2    1    10   2.0  0.33627099 0.920093934
32   7   2    2    10   2.0  0.75654214 1.002314651
33   7   2    3    10   2.0  0.88715737 1.003958435
34   7   2    4    10   2.0  0.90800192 1.003988593
35   7   2    5    10   2.0  0.91089154 1.003989145
36   8   2    1  1000   2.0 -0.08131981 0.102435481
37   8   2    2  1000   2.0 -0.03688314 0.860350536
38   8   2    3  1000   2.0  0.19880473 1.000926458
39   8   2    4  1000   2.0  0.66014952 1.003932891
40   8   2    5  1000   2.0  0.86791705 1.003988125

So what I did before is split up the data frame using by and wanted to obtain averages and standard deviations for every step (it's a time series) and each combination of factors.

After having looked at all your answers and having reconsidered my problem, I think what I'm trying to do would be better handled during the conversion process of by. I'm not exactly sure how to do that... What I want the output to look like is a summary of sorts:

> df
   run fac.a fac.b  mean.treatment1  mean.treatment2 sd.treatment1 sd.treatment2
1    1    10   0.5        xxxxxxxxx       xxxxxxxxxx    xxxxxxxxxx   xxxxxxxxxxx
1    1    10   2.0        xxxxxxxxx       xxxxxxxxxx    xxxxxxxxxx   xxxxxxxxxxx
1    1  1000   0.5        xxxxxxxxx       xxxxxxxxxx    xxxxxxxxxx   xxxxxxxxxxx
1    1  1000   2.0        xxxxxxxxx       xxxxxxxxxx    xxxxxxxxxx   xxxxxxxxxxx

Is this a job for aggregate? Thanks for your patience and help. -- Glenn


Original question:

I have a data.frame with many columns, each of which represents a specific experimental condition with repetitions.

> df <- data.frame(a.1 = runif(5), b.1 = runif(5), a.2 = runif(5), b.2 = runif(5), mean.a = 0, mean.b = 0, mean.1 = 0, mean.2 = 0)
> df
        a.1       b.1       a.2       b.2 mean.a mean.b   sd.a   sd.b
1 0.9209433 0.3501444 0.3893140 0.3264827      0      0      0      0
2 0.4171254 0.4883140 0.8282384 0.1215129      0      0      0      0
3 0.2291582 0.9419946 0.4089008 0.5665242      0      0      0      0
4 0.3807868 0.1889066 0.8271075 0.4022014      0      0      0      0
5 0.5863078 0.4991847 0.4082745 0.5637367      0      0      0      0

I want to find means and standard deviations for each condition and repetition. So far the most direct way seems to be:

for (i in c("a.1", "a.2") {df$mean.a <- df$mean.a + df[[i]]}
df$mean.a <- df$mean.a / 2

But I have a lot of columns, and they are all over the place, so this seems really labor intensive and manual. A little nicer method is to use ave():

df$mean.a <- with (df, ave(a.1, a.2))

But if I want to do sd() instead, I mysteriously get NAs:

df$sd.a <- with (df, ave(a.1, a.2, FUN = sd))
> df
        a.1       b.1       a.2       b.2    mean.a mean.b   sd.a   sd.b
1 0.9209433 0.3501444 0.3893140 0.3264827 0.9209433      0     NA      0
2 0.4171254 0.4883140 0.8282384 0.1215129 0.4171254      0     NA      0
3 0.2291582 0.9419946 0.4089008 0.5665242 0.2291582      0     NA      0
4 0.3807868 0.1889066 0.8271075 0.4022014 0.3807868      0     NA      0
5 0.5863078 0.4991847 0.4082745 0.5637367 0.5863078      0     NA      0

I would prefer not to use external packages if possible, but it seems like I'm missing something basic. This question was similar, but had to do with data.tables, not data.frames.

Another was even closer, but using ave() is also tedious to specify, for instance, columns 1-12, 15-17, and 26 as the subject columns, and mysteriously, sd() produces those NA's. Seems like there should be a straightforward way to do this. Almost makes me wish for Excel. :-)


回答1:


Let us first bring your data into an acceptable format. Note that this solution does, against your initial requirements, indeed rely on external libraries, but they are very common and true timesavers today! (plyr and reshape2 by Hadley Wickham, who is a phenomenon in the R community)

# Note how I only used the data columns, initially, there is no mean and sd column in the data frame used at this stage.
df <- data.frame(a.1 = runif(5), b.1 = runif(5), a.2 = runif(5), b.2 = runif(5))

df$repetition = c(1:nrow(df))
library(reshape2)
tmp = melt(df, id.vars = "repetition")
names(tmp)[2] = "condition"

tmp$treatment = substring(tmp$condition,1,1)

This yields:

> head(tmp)
  repetition condition     value treatment
1          1       a.1 0.6668952         a
2          2       a.1 0.1248151         a
3          3       a.1 0.7082199         a
4          4       a.1 0.9840956         a
5          5       a.1 0.4479190         a
6          1       b.1 0.9381539         b

Now, the rest is easy, we rely on the popular plyr package:

library(plyr)
results = ddply(tmp, .(repetition, treatment), summarize, mean = mean(value), sd = sd(value) )

The final result is

> head(results)
  repetition treatment      mean         sd
1          1         a 0.6777342 0.01532853
2          1         b 0.6734955 0.37428353
3          2         a 0.4533126 0.46456561
4          2         b 0.8441925 0.07260509
5          3         a 0.3967338 0.44050779
6          3         b 0.5886821 0.42635902

Let's hope this is what you were looking for.

One more interesting addition, if you do not want to differentiate each repetition, but rather on a treatment level

# addition
results = ddply(tmp, .( treatment), summarize, mean = mean(value), sd = sd(value) )

and the result:

> head(results)
  treatment      mean        sd
1         a 0.5817867 0.2954151
2         b 0.6212537 0.3219035



回答2:


Ignoring the "base-only" requirement to whip the data into shape, using tidyr and the pipe operator from magrittr:

set.seed(42)
df  <- data.frame(a.1 = runif(5), b.1 = runif(5), a.2 = runif(5), b.2 = runif(5))
df2 <- df %>%
  gather(treatment, value) %>%
  separate(treatment, c("treatment", "repetition"))
head(df2)
#    treatment repetition      value
# 1          a          1 0.13871017
# 2          a          1 0.98889173
# 3          a          1 0.94666823
# 4          a          1 0.08243756
# 5          a          1 0.51421178
# 6          b          1 0.39020347

Now, I'm not sure what exactly you're trying to get the average and standard deviation of, but one easy option is aggregate() from base R. Simple pass the function you'd like through the FUN parameter:

# calculate mean on treatment (a or b)
aggregate(df2$value, by = list(treatment = df2$treatment), FUN = mean)
#   treatment repetition         x
# 1         a          1 0.5341839
# 2         b          1 0.6633022
# 3         a          2 0.5442395
# 4         b          2 0.4225865

# calculate mean on treatment and repetition
aggregate(df2$value, by = list(treatment = df2$treatment, repetition = df2$repetition, FUN = mean)
#   treatment         x
# 1         a 0.5392117
# 2         b 0.5429444



回答3:


Based on the code you showed, may be this base R method would help:

 set.seed(42)
 df <- data.frame(a.1 = runif(5), b.1 = runif(5), a.2 = runif(5), b.2 = runif(5))
   do.call(cbind,
     lapply(split(seq_along(df),gsub("\\..*", "",colnames(df))), function(x) {
        x1 <- df[,x]
        data.frame(Means=rowMeans(x1, na.rm=TRUE), SD=apply(x1, 1, sd, na.rm=TRUE))}))
  #  a.Means      a.SD   b.Means       b.SD
  #1 0.6862739 0.3231932 0.7295552 0.29763438
  #2 0.8280938 0.1541232 0.8574074 0.17086395
  #3 0.6104059 0.4585819 0.1260770 0.01214755
  #4 0.5429382 0.4065997 0.5659947 0.12869005
  #5 0.5520192 0.1268922 0.6326988 0.10234101

Using your code, I get the same result

  vec1 <- vector("numeric", length=5)
  for(i in c("a.1", "a.2")) {vec1 <- vec1+df[[i]]}
  vec1/2
  #[1] 0.6862739 0.8280938 0.6104059 0.5429382 0.5520192


来源:https://stackoverflow.com/questions/25900795/how-do-i-create-a-column-of-means-of-specific-columns-in-a-data-frame

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!