Adding a non-aggregated column to an aggregated data set based on the aggregation of another column

…衆ロ難τιáo~ 提交于 2019-12-10 20:11:48

问题


Is it possible to use the aggregate function to add another column from the original data frame, without actually using that column to aggregate the data?

This is a very simplied version of data that will help illustrate my question (let's call it data)

name      result.1    result.2    replicate    day     data.for.mean
"obj.1"   1           "good"      1            1        5
"obj.1"   1           "good"      2            1        7
"obj.1"   1           "great"     1            2        6
"obj.1"   1           "good"      2            2        9
"obj.1"   2           "bad"       1            1        10
"obj.1"   2           "not good"  2            1        6
"obj.1"   2           "bad"       1            2        5
"obj.1"   2           "not good"  2            2        3

"obj.2"   1           "excellent" 1            1        14
"obj.2"   1           "good"      2            1        10
"obj.2"   1           "good"      1            2        11
"obj.2"   1           "not bad"   2            2        7
"obj.2"   2           "bad"       1            1        4
"obj.2"   2           "bad"       2            1        3
"obj.2"   2           "horrible"  1            2        2
"obj.2"   2           "dismal"    2            2        1

You'll notice that result.1 and result.2 are tied, such that if result.1 == 1, result.2 is good/great, and if result.1 == 2, then result.2 == bad/not good. I need both of these columns in the aggregated data set and it doesn't matter which value from result.2 is picked when the data is aggregated, I just need the information to identify whether result.1 column's 1 value is good/bad and simiarly for result.2. So it could have all values of "dismal" corresponding with all of result.1's values of 2.

The problem is that, since result.2 uses different names to identify good/bad, I cannot use it to as a column to aggregate by.

Currently my aggregate function looks like this...

aggregated.data <- aggregate(data[c("data.for.mean")], 
            by=data[c("name", "result.1", "day") ],
            FUN= mean }
        );

which would giving one line of output such as this...

name     result.1    day    data.for.mean
"obj.1"  1           1      6

(All of the replicates for obj.1, with a result.1 == 1, on day1 have been averged. They had a value of 5 and 7 and were the first two rows in my mock data set.)

What I would like would produce a line of output such as this

name     result.1    result.2    day    data.for.mean
"obj.1"  1           "good"      1      6

Again, "good" could be replaced with "great", "not bad", "excellent", for all values which correspond to result.1's value of '1'.

What would be the best method of capturing information from result.2 and adding it to aggregated.data (the output of the aggregate function)?

Thank you.


回答1:


Here's a solution in base, which uses merge followed by another aggregate:

agg.2 <- merge(aggregated.data, data[,names(data) != 'data.for.mean'])
aggregate(result.2 ~ name+result.1+day+data.for.mean, data=agg.2, FUN=sample, size=1)
##    name result.1 day data.for.mean  result.2
## 1 obj.2        2   2           1.5    dismal
## 2 obj.2        2   1           3.5       bad
## 3 obj.1        2   2           4.0       bad
## 4 obj.1        1   1           6.0      good
## 5 obj.1        1   2           7.5     great
## 6 obj.1        2   1           8.0  not good
## 7 obj.2        1   2           9.0   not bad
## 8 obj.2        1   1          12.0 excellent

Here's how this works:

The merge adds in the result.2 values, but will create multiple rows where there are multiple such values. Then aggregate is used to select one of these rows.

As you say you don't care which of the relevant result.2 labels you get, I'm getting one at random with sample.

To return the first result.2 label, use head with n=1 instead:

aggregate(result.2 ~ name+result.1+day+data.for.mean, data=agg.2, FUN=head, n=1)

Similarly, to get the last such label, use tail with n=1.




回答2:


How about this with dplyr:

require(dplyr)
group_by(data,name,result.1,day) %.% summarise(mean=mean(data.for.mean),result.2=result.2[1])


#Source: local data frame [8 x 5]
#Groups: name, result.1

#   name result.1 day mean  result.2
#1 obj.2        1   2  9.0      good
#2 obj.2        1   1 12.0 excellent
#3 obj.1        1   1  6.0      good
#4 obj.1        1   2  7.5     great
#5 obj.1        2   2  4.0       bad
#6 obj.1        2   1  8.0       bad
#7 obj.2        2   2  1.5  horrible
#8 obj.2        2   1  3.5       bad


来源:https://stackoverflow.com/questions/21397058/adding-a-non-aggregated-column-to-an-aggregated-data-set-based-on-the-aggregatio

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!