Sum by group with multiple logical conditions while omitting values from sum R data.table

不问归期 提交于 2019-12-11 09:28:24

问题


I am having trouble figuring out how to sum rows in a data.table while omitting the values of a certain group in the process.

Let's say I have a data.table of the following form:

library(data.table)
dt <- data.table(year = c(2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003), 
               name = c("Tom", "Tom", "Tom", "Tom", "Fred", "Fred", "Fred", "Fred", "Gill", "Gill", "Gill", "Gill", "Ann", "Ann", "Ann", "Ann"),
               g1 = c(1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1),
               g2 = c(1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1),
               g3 = c(1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1),
               g4 = c(0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1))

setkey(dt, name, year)

where g1-g4 are indicator variables for games in which the players in name participated at time year.

What I want to do is to calculate the number of players for each game NPg1-NPg4 in which both players participated in the focal game, but only if they also played against each other in another game in the same year and this sum should exclude the player for whom it is being calculated.

I get close using this code modified from how to cumulatively add values in one vector in R e.g for NPg1:

dtg1 <- dt[,.SD[(g1==1) & (g2==1 | g3==1 | g4==1)][, NPg1:= sum(g1)], by=year]

This subsets the dt on my conditions and creates the sum, however, the sum includes the focal players. For example NPg1 in year==2000 is 1 for Tom, but it should be 0 because even though he played in g1 he did not play another player in another game in that year. Once I get the sum right, I can then do this for each game and combine the results back into a data.table. The main question is, how can I get the correct sum with these conditions.

The result for NPg1 should look like this

dtg1$NPg1result <- c(0, 0, 0, 3, 3, 3, 3, 3, 3, 3, 3)

Any ideas would be greatly appreciated.

After @Mike.Gahan's comment:

This is the sub-result for g1, maybe this does not become very clear form my post. Once I have that correctly I could easily join it back to the full data.table using:

library(plyr)
dt <- join(dt, dtg1)

or other merge/join operations but since my question is mainly concerned with the sub-result I did not want to bother everyone with the rest.

EDIT after @ Ricardo Saportas solution

The full desired result with all the games looks as follows:

dtresult <- data.table(year = c(2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003), 
                   name = c("Ann", "Ann", "Ann", "Ann", "Fred", "Fred", "Fred", "Fred", "Gill", "Gill", "Gill", "Gill", "Tom", "Tom", "Tom", "Tom"), 
                   NPg1 = c(0, 1, 3, 3, 0, 0, 3, 3, 0, 0, 3, 3, 0, 1, 3, 3), 
                   NPg2 = c(0, 0, 2, 3, 0, 0, 2, 3, 1, 0, 0, 3, 1, 0, 2, 3), 
                   NPg3 = c(0, 0, 3, 2, 0, 2, 3, 0, 1, 2, 3, 2, 1, 2, 3, 2), 
                   NPg4 = c(0, 0, 2, 2, 0, 1, 0, 0, 0, 1, 2, 2, 0, 0, 2, 2))

回答1:


One approach is to do a cartesian join on the year-g1-g2-..-gn combinations.

Then on the new table, you can "ignore the rows" [see note at bottom] that do not qualify -- namely, players playing against themselves, and those player-combinations that only played one game.

setkeyv(dt, c("year", games))
dt.merged <- merge(dt, dt, all=TRUE, allow.cartesian=TRUE, suffixes=c("", ".y"))
## ignore players playing against themselves
dt.merged[name != name.y, (games) := 0 ]
## ignore player combinations that only shared one game
dt.merged[ (rowSums(dt.merged[, games, with=FALSE]) <= 1) , (games) := 0 ]
## now just sum itup
results <- dt.merged[, lapply(.SD, sum), keyby=list(year, name), .SDcols=games]
## clean up the names
setnames(results, games, paste0("NP", games))

Which yields

results

    year name g1 g2 g3 g4
 1: 2000  Ann  0  0  0  0
 2: 2000 Fred  0  0  0  0
 3: 2000 Gill  0  1  1  1
 4: 2000  Tom  1  1  1  0
 5: 2001  Ann  1  1  0  0
 6: 2001 Fred  0  0  1  1
 7: 2001 Gill  0  0  1  1
 8: 2001  Tom  1  0  1  0
 9: 2002  Ann  1  1  1  1
10: 2002 Fred  1  1  1  0
11: 2002 Gill  1  0  1  1
12: 2002  Tom  1  1  1  1
13: 2003  Ann  1  1  1  1
14: 2003 Fred  1  1  0  0
15: 2003 Gill  1  1  1  1
16: 2003  Tom  1  1  1  1

Note that you have two options to "ignore the row"

If you want to preserve the "0" count for the year-player, then use

dt.merged[ <filter>,  (games) := 0 ]

If you do not care for the "0" count for the year-player, then use

dt.merged <- dt.merged[ ! <filter> ]


来源:https://stackoverflow.com/questions/25516701/sum-by-group-with-multiple-logical-conditions-while-omitting-values-from-sum-r-d

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!