How to most efficiently filter a dataframe conditionnaly of values in another one, in the tidyverse framework?

孤街醉人 提交于 2019-12-13 04:12:01

问题


I have a dataframe df1 with an ID column and a lubridate time interval column, and I want to filter (subsample) a dataframe df2, which has ID and DateTime columns, so that only df2 rows with DateTime fitting the corresponding ID interval in df1 are kept. I want to do so in a tidyverse framework.
It can easily be done using a join (see example below), but I would like to know whether there would be a more direct solution (maybe purrr-based) that would avoid joining and then removing the time-interval data from the second dataframe. Thanks.

The question posted here Merge two dataframes if timestamp of x is within time interval of y is close to the one asked here but proposed solution were similar to the one I developed and not in a tidyverse framework.

A minimal code to show the problem and my current solution:

library(tibble)  
library(lubridate)

df1 <- tribble(
  ~ID, ~Date1, ~Date2,
  "ID1", "2018-04-16", "2018-06-14",
  "ID2", "2018-04-20", "2018-06-25") 
df1 <- mutate(df1,Interval = interval(ymd(Date1),ymd(Date2)))

df2 <- tribble(
  ~ID, ~DateTime,
  "ID1", "2018-04-12",
  "ID1", "2018-05-05",
  "ID2", "2018-04-23",
  "ID2", "2018-07-12")
df2 <- mutate(df2,DateTime=ymd(DateTime)) 

df1 looks like this

> df1
# A tibble: 2 x 4
  ID    Date1      Date2      Interval                      
  <chr> <chr>      <chr>      <S4: Interval>                
1 ID1   2018-04-16 2018-06-14 2018-04-16 UTC--2018-06-14 UTC
2 ID2   2018-04-20 2018-06-25 2018-04-20 UTC--2018-06-25 UTC

and df2 like this:

> df2
# A tibble: 4 x 2
  ID    DateTime  
  <chr> <date>    
1 ID1   2018-04-12
2 ID1   2018-05-05
3 ID2   2018-04-23
4 ID2   2018-07-12

In df2, the second record for ID1 is not within the ID1 interval in df1. The second record for ID2 is also not within the ID2 interval in df1.

My current solution based on joining and the removing the joined column follows:

df_out <- df2 %>%
  left_join(.,df1,by="ID") %>%
  filter(.,DateTime %within% Interval) %>%
  select(.,-Interval)

> df_out
# A tibble: 2 x 4
  ID    DateTime   Date1      Date2     
  <chr> <date>     <chr>      <chr>     
1 ID1   2018-05-05 2018-04-16 2018-06-14
2 ID2   2018-04-23 2018-04-20 2018-06-25

I have the feeling a tidyverse alternative that would avoid joining and then removing the Interval column should exist.


回答1:


There is a package called fuzzyjoin that can do a semi_join based on an interval. Semi join means that it filters the "left" dataframe depending on match to the "right" dataframe. Try:

library(fuzzyjoin)
df2 %>% 
  fuzzy_semi_join(df1, 
                  by=c("DateTime"="Date1", "DateTime"="Date2"),
                  match_fun=list(`>=`, `<=`))

Which gives the result:

# A tibble: 2 x 2
  ID    DateTime  
  <chr> <date>    
1 ID1   2018-05-05
2 ID2   2018-04-23


来源:https://stackoverflow.com/questions/55706637/how-to-most-efficiently-filter-a-dataframe-conditionnaly-of-values-in-another-on

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!