Easier way to calculate conditioned proportions with R data.table?

烈酒焚心 提交于 2021-01-23 08:20:48

问题


Let's say we have this toy example:

library(data.table)
temp <- data.table(first=c("A", "A","A", "A","B","C","C"), 
   sec=c("X", "X","X", "Y","X",  "Z","Z"), stringsAsFactors = T))

first sec
  A   X
  A   X
  A   X
  A   Y
  B   X
  C   Z
  C   Z

I would like to get a third column stating the proportion of times that combination happens among the occurrences of the first column.

I got to do it with data.table in the following way:

temp[,N1:=.N,by=.(first, sec)]
temp[,N2:=.N,by=first]
temp[, prop := N1/N2]
temp[,c("N1","N2"):=NULL]


first sec prop
   A   X 0.75
   A   X 0.75
   A   X 0.75
   A   Y 0.25
   B   X 1.00
   C   Z 1.00
   C   Z 1.00

That means that A,X happens 3 times. A happens 4 times, thus AX happens the 75% of the time the first letter is A.

Couldn't that be done easier?

Kind of

temp[,.N(first,sec)/.N(first)]

回答1:


What about doing grouping inside each first-level group with ave:

temp[, prop := ave(as.numeric(sec), sec, FUN = length) / .N, by = .(first)]
# > temp
#    first sec prop
# 1:     A   X 0.75
# 2:     A   X 0.75
# 3:     A   X 0.75
# 4:     A   Y 0.25
# 5:     B   X 1.00
# 6:     C   Z 1.00
# 7:     C   Z 1.00

Inspired by: https://stackoverflow.com/a/32003058/3926543

Another way: chain commands:

temp[,N:=.N,by=.(first, sec)][, prop := N / .N, by = .(first)][, N := NULL]

when first and sec have many levels and temp has many more rows, chaining commands or the solution in OP's question will be faster than ave solution.




回答2:


Here is an alternative. It is not simpler, in my mind, but sort of embodies the idea that you have. The idea is to count the separate values of sec within first using table, return the proportion and the corresponding value of sec, drop counts that are 0, and then join the returned data.table onto the original.

temp[temp[, {cnt=table(sec); .(sec=names(cnt), prop=c(cnt) / .N)}, by=first][prop > 0],
     on=c("first", "sec")]
   first sec prop
1:     A   X 0.75
2:     A   X 0.75
3:     A   X 0.75
4:     A   Y 0.25
5:     B   X 1.00
6:     C   Z 1.00
7:     C   Z 1.00


来源:https://stackoverflow.com/questions/44801588/easier-way-to-calculate-conditioned-proportions-with-r-data-table

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!