问题
I have been recently starting to use the data.table package in R. I find it super-convenient for transforming and aggregating data. One thing that I miss is how do you transform data that are defined on multiple rows? Do I need to reshape the data.frame/table in a wide format first?
Say you have the following data table:
dt=data.table(group=c("a","a","a","b","b","b"),
subg=c("f1","f2","f3","f1","f2","f3"),
counts=c(3,4,5,8,9,10))
and for each group you want to calculate the relative frequency of each subgroup (c1/(c1+c2+c3)) and other properties as a function of c1, c2 ,c3 (c1, c2, c3 are the counts associated to f1, f2 and f3).
I can see how transform the data table in a wide format and then apply the transformation. Is there any way to calculate this directly in the long format (ideally using the data table)?
In general the group and subgroup could be represented by multiple factors.
回答1:
If I understand OP correctly, you want smth like this:
dt[, {bigN = .N; .SD[, .N / bigN, by = subg]}, by = group]
or maybe (and very similarly) this:
dt[, {counts.sum = sum(counts); .SD[, counts / counts.sum, by = subg]},
by = group]
回答2:
If you are using the data.frame, you can use ddply from plyr package (two-step approach):
dt1<-ddply(dt,.(group),transform, gcount=sum(counts))# gcount=sum of count for each group
>dt1
group subg counts gcount
1 a f1 3 12
2 a f2 4 12
3 a f3 5 12
4 b f1 8 27
5 b f2 9 27
6 b f3 10 27
dt2<-ddply(dt1,.(group,subg),transform,rel.count=counts/gcount) #rel.count=relative frequency
>dt2
group subg counts gcount rel.count
1 a f1 3 12 0.2500000
2 a f2 4 12 0.3333333
3 a f3 5 12 0.4166667
4 b f1 8 27 0.2962963
5 b f2 9 27 0.3333333
6 b f3 10 27 0.3703704
来源:https://stackoverflow.com/questions/18110933/how-to-integrate-properties-defined-on-multiple-rows-using-a-data-frame-or-data