How to perform a join based on intervals with dplyr?

徘徊边缘 提交于 2019-12-12 05:43:16

问题


I have a data frame containing two columns: a grouping variable and a interval period over which the grouping variable holds. I have another data frame with a date column and a value column. How can I join these two tables together somewhat efficiently with dplyr+tidyverse functions?

library(dplyr)
library(lubridate)
ty <- data_frame(date = mdy(paste(1, 1 + seq(20), 2017, sep = "/")), 
                 y = c(rnorm(7), rnorm(7, mean = 2), rnorm(6, mean = -1)))
gy <- data_frame(period = interval(mdy(c("01/01/2017", "01/08/2017", "01/15/2017")), 
                                   mdy(c("01/07/2017", "01/14/2017", "01/20/2017"))), 
                          batch = c(1, 2, 3))

I want to build the table that is equivalent to:

ty %>% mutate(batch = c(rep(1, 7), rep(2, 7), rep(3, 6)))

Ideally, this should work reasonably quickly on data sets of up to 1,000,000 rows. Better still if it works on 100,000,000 :).


回答1:


How about:

ty %>% 
  mutate(batch = case_when(
  ty$date %within% gy$period[1] ~gy$batch[1],
  ty$date %within% gy$period[2] ~gy$batch[2],
  ty$date %within% gy$period[3] ~gy$batch[3]))

You would obviously need to define the case_when intervals. How many have you got? I've used cat and paste0 with good effect for that in the past.

Edited to reflect OP's comments. This should take care of the NSE and would allow the generation of the case_when intervals programatically:

ty %>%
  mutate(batch = eval(parse(text = paste0("case_when(",
                                      paste(
                                        paste0(
                                          "ty$date %within% gy$period[",
                                          seq_along(gy$period),
                                          "] ~gy$batch[",
                                          seq_along(gy$period),
                                          "]"
                                        ),
                                        collapse = ", "
                                      ), ")"))))



回答2:


This is the best I could come up with so far:

ty$batch <- unlist(lapply(ty$date, function(d) gy$batch[which(d %within% gy$period)]), recursive = FALSE, use.names = FALSE)

But it doesn't seem very fast.



来源:https://stackoverflow.com/questions/44398318/how-to-perform-a-join-based-on-intervals-with-dplyr

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!