Using dplyr for frequency counts of interactions, must include zero counts

蓝咒 提交于 2019-12-02 17:45:43

Here's a simple option, using data.table instead:

library(data.table)

dt = as.data.table(your_df)

setkey(dt, id, date)

# in versions 1.9.3+
dt[CJ(unique(id), unique(date)), .N, by = .EACHI]
#          id       date N
# 1: Andrew13 2006-08-03 0
# 2: Andrew13 2007-09-11 1
# 3: Andrew13 2008-06-12 0
# 4: Andrew13 2008-10-11 0
# 5: Andrew13 2009-07-03 0
# 6:   John12 2006-08-03 1
# 7:   John12 2007-09-11 0
# 8:   John12 2008-06-12 0
# 9:   John12 2008-10-11 0
#10:   John12 2009-07-03 0
#11:  Lisa825 2006-08-03 0
#12:  Lisa825 2007-09-11 0
#13:  Lisa825 2008-06-12 0
#14:  Lisa825 2008-10-11 0
#15:  Lisa825 2009-07-03 1
#16:  Tom2993 2006-08-03 0
#17:  Tom2993 2007-09-11 0
#18:  Tom2993 2008-06-12 1
#19:  Tom2993 2008-10-11 1
#20:  Tom2993 2009-07-03 0

In versions 1.9.2 or before the equivalent expression omits the explicit by:

dt[CJ(unique(id), unique(date)), .N]

The idea is to create all possible pairs of id and date (which is what the CJ part does), and then merge it back, counting occurrences.

docendo discimus

This is how you could do it, although I use dplyr only in part to calculate the frequencies in your original df and for the left_join. As you already suggested in your question, I created a new data.frame and merged it with the existing. I guess if you want to do it exclusively in dplyr that would require you to somehow rbind many rows in the process and I assume this way might be faster than the other.

require(dplyr)

original <- read.table(header=T,text="    id         date
John12     2006-08-03
Tom2993    2008-10-11
Lisa825    2009-07-03
Tom2993    2008-06-12
Andrew13   2007-09-11", stringsAsFactors=F)

original$date <- as.Date(original$date) #convert to date

#get the frequency in original data in new column and summarize in a single row per group
original <- original %>%
  group_by(id, date) %>%
  summarize(count = n())            

#create a sequence of date as you need it
dates <- seq(as.Date("2006-01-01"), as.Date("2009-12-31"), 1)    

#create a new df with expand.grid to get all combinations of date/id
newdf <- expand.grid(id = original$id, date = dates)     

#remove dates
rm(dates)

#join original and newdf to have the frequency counts from original df
newdf <- left_join(newdf, original, by=c("id","date"))   

#replace all NA with 0 for rows which were not in original df
newdf$count[is.na(newdf$count)] <- 0          
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!