问题
I am having trouble figuring out how to sum rows in a data.table
while omitting the values of a certain group in the process.
Let's say I have a data.table
of the following form:
library(data.table)
dt <- data.table(year = c(2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003),
name = c("Tom", "Tom", "Tom", "Tom", "Fred", "Fred", "Fred", "Fred", "Gill", "Gill", "Gill", "Gill", "Ann", "Ann", "Ann", "Ann"),
g1 = c(1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1),
g2 = c(1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1),
g3 = c(1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1),
g4 = c(0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1))
setkey(dt, name, year)
where g1
-g4
are indicator variables for games in which the players in name
participated at time year
.
What I want to do is to calculate the number of players for each game NPg1
-NPg4
in which both players participated in the focal game, but only if they also played against each other in another game in the same year and this sum should exclude the player for whom it is being calculated.
I get close using this code modified from how to cumulatively add values in one vector in R e.g for NPg1
:
dtg1 <- dt[,.SD[(g1==1) & (g2==1 | g3==1 | g4==1)][, NPg1:= sum(g1)], by=year]
This subsets the dt
on my conditions and creates the sum, however, the sum includes the focal players. For example NPg1
in year==2000
is 1 for Tom, but it should be 0 because even though he played in g1
he did not play another player in another game in that year. Once I get the sum right, I can then do this for each game and combine the results back into a data.table
. The main question is, how can I get the correct sum with these conditions.
The result for NPg1
should look like this
dtg1$NPg1result <- c(0, 0, 0, 3, 3, 3, 3, 3, 3, 3, 3)
Any ideas would be greatly appreciated.
After @Mike.Gahan's comment:
This is the sub-result for g1
, maybe this does not become very clear form my post. Once I have that correctly I could easily join it back to the full data.table
using:
library(plyr)
dt <- join(dt, dtg1)
or other merge/join operations but since my question is mainly concerned with the sub-result I did not want to bother everyone with the rest.
EDIT after @ Ricardo Saportas solution
The full desired result with all the games looks as follows:
dtresult <- data.table(year = c(2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003, 2000, 2001, 2002, 2003),
name = c("Ann", "Ann", "Ann", "Ann", "Fred", "Fred", "Fred", "Fred", "Gill", "Gill", "Gill", "Gill", "Tom", "Tom", "Tom", "Tom"),
NPg1 = c(0, 1, 3, 3, 0, 0, 3, 3, 0, 0, 3, 3, 0, 1, 3, 3),
NPg2 = c(0, 0, 2, 3, 0, 0, 2, 3, 1, 0, 0, 3, 1, 0, 2, 3),
NPg3 = c(0, 0, 3, 2, 0, 2, 3, 0, 1, 2, 3, 2, 1, 2, 3, 2),
NPg4 = c(0, 0, 2, 2, 0, 1, 0, 0, 0, 1, 2, 2, 0, 0, 2, 2))
回答1:
One approach is to do a cartesian join on the year-g1-g2-..-gn
combinations.
Then on the new table, you can "ignore the rows" [see note at bottom] that do not qualify -- namely, players playing against themselves, and those player-combinations that only played one game.
setkeyv(dt, c("year", games))
dt.merged <- merge(dt, dt, all=TRUE, allow.cartesian=TRUE, suffixes=c("", ".y"))
## ignore players playing against themselves
dt.merged[name != name.y, (games) := 0 ]
## ignore player combinations that only shared one game
dt.merged[ (rowSums(dt.merged[, games, with=FALSE]) <= 1) , (games) := 0 ]
## now just sum itup
results <- dt.merged[, lapply(.SD, sum), keyby=list(year, name), .SDcols=games]
## clean up the names
setnames(results, games, paste0("NP", games))
Which yields
results
year name g1 g2 g3 g4
1: 2000 Ann 0 0 0 0
2: 2000 Fred 0 0 0 0
3: 2000 Gill 0 1 1 1
4: 2000 Tom 1 1 1 0
5: 2001 Ann 1 1 0 0
6: 2001 Fred 0 0 1 1
7: 2001 Gill 0 0 1 1
8: 2001 Tom 1 0 1 0
9: 2002 Ann 1 1 1 1
10: 2002 Fred 1 1 1 0
11: 2002 Gill 1 0 1 1
12: 2002 Tom 1 1 1 1
13: 2003 Ann 1 1 1 1
14: 2003 Fred 1 1 0 0
15: 2003 Gill 1 1 1 1
16: 2003 Tom 1 1 1 1
Note that you have two options to "ignore the row"
If you want to preserve the "0" count for the year-player, then use
dt.merged[ <filter>, (games) := 0 ]
If you do not care for the "0" count for the year-player, then use
dt.merged <- dt.merged[ ! <filter> ]
来源:https://stackoverflow.com/questions/25516701/sum-by-group-with-multiple-logical-conditions-while-omitting-values-from-sum-r-d