Must ddply use all possible combinations of the splitting variable(s), or only observed?

若如初见. 提交于 2019-12-11 12:55:53

问题


I have a data frame called thetas containing about 2.7 million observations.

> str(thetas)
'data.frame':   2700000 obs. of  8 variables:
 $ rho_cnd   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ pct_cnd   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ sx        : num  1 2 3 4 5 6 7 8 9 10 ...
 $ model     : Factor w/ 7 levels "dN.mN","dN.mL",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ estTheta  : num  -1.58 -1.716 0.504 -2.296 0.98 ...
 $ trueTheta : num  0.0962 -3.3913 3.6006 -0.1971 2.1906 ...
 $ estError  : num  -1.68 1.68 -3.1 -2.1 -1.21 ...
 $ trueAberSx: num  0 0 0 0 0 0 0 0 0 0 ...

I would like to use ddply, or some similar function, to sum the error of estimation (the column estError in my data frame), but where the sums are within each condition of my simulation. The problem is, I don't have a simple way to combine values from the other columns of this data frame to uniquely identify all those conditions. To be more specific: the column model contains 7 possible values. Three of these possible values are only matched up with one possible value in each of rho_cnd and pct_cnd, while the other four possible values of model are matched up with 6 possible pairings of values in rho_cnd and pct_cnd.

The obvious solution, I know, would be to go back and make a variable that uniquely identifies all the conditions that I would need to identify here, so that the following code would work:

> sums <- ddply(thetas,.(condition1,condition2,etc.),sum(estError))

But I just don't want to go back and recreate how this data frame is built. Right now I have two data frames created with two separate calls to expand.grid that are then rbinded and sorted to create a data frame listing all valid conditions, but even if I kept those few lines of code in I'm not sure how to reference them with ddply. I would rather not even use this solution, but I will if necessary.

> conditions 
   models rhos pcts
1   dN.mN  0.0 0.00
2   dN.mL  0.0 0.00
3   dN.mH  0.0 0.00
4   dL.mN  0.1 0.01
12  dL.mN  0.1 0.02
20  dL.mN  0.1 0.10
8   dL.mN  0.2 0.01
16  dL.mN  0.2 0.02
24  dL.mN  0.2 0.10
5   dL.mL  0.1 0.01
13  dL.mL  0.1 0.02
21  dL.mL  0.1 0.10
9   dL.mL  0.2 0.01
17  dL.mL  0.2 0.02
25  dL.mL  0.2 0.10
6   dH.mN  0.1 0.01
14  dH.mN  0.1 0.02
22  dH.mN  0.1 0.10
10  dH.mN  0.2 0.01
18  dH.mN  0.2 0.02
26  dH.mN  0.2 0.10
7   dH.mH  0.1 0.01
15  dH.mH  0.1 0.02
23  dH.mH  0.1 0.10
11  dH.mH  0.2 0.01
19  dH.mH  0.2 0.02
27  dH.mH  0.2 0.10

Any advice for better code and/or more efficiency? Thanks!


回答1:


I agree with the comment that ddply(thetas,.(model,rho_cnd,pct_cnd),...) should work. If certain combinations of those variables don't show up, ddply(..., .drop=TRUE) will ensure that the unobserved combinations don't show up.

However, if you wanted to avoid ddply looking through some of the non-existant combinations, you could try something like the following:

#newCond <- apply(thetas[,c("model", "rho_cnd", "pct_cnd")], 1, paste, collapse="_")
newCond <- do.call(paste, thetas[,c("model", "rho_cnd", "pct_cnd")], sep="_") #as suggested by baptiste
thetas2 <- cbind(thetas, newCond)

I admit, the above code might run slowly for you, so I'm not sure it's what you want. But from there you should be able to use ddply() with .variables=newCond.

Furthermore, because you're returning only a single number for each subset of the data, you could just use aggregate, if you wanted.

sums <- aggregate(thetas2[,"estError"], by=thetas2[,"newCond"], colSums)

I hope this helps.



来源:https://stackoverflow.com/questions/16363834/must-ddply-use-all-possible-combinations-of-the-splitting-variables-or-only-o

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!