Find which interval row in a data frame that each element of a vector belongs in

我们两清 提交于 2019-11-27 01:47:02
David Arenburg

Here's a possible solution using the new "non-equi" joins in data.table (v>=1.9.8). While I doubt you'll like the syntax, it should be very efficient soluion.

Also, regarding findInterval, this function assumes continuity in your intervals, while this isn't the case here, so I doubt there is a straightforward solution using it.

library(data.table) #v1.10.0
setDT(intervals)[data.table(elements), on = .(start <= elements, end >= elements)]
#    phase start end
# 1:     a   0.1 0.1
# 2:     a   0.2 0.2
# 3:     a   0.5 0.5
# 4:    NA   0.9 0.9
# 5:     b   1.1 1.1
# 6:     b   1.9 1.9
# 7:     c   2.1 2.1

Regarding the above code, I find it pretty self-explanatory: Join intervals and elements by the condition specified in the on operator. That's pretty much it.

There is a certain caveat here though, start, end and elements should be all of the same type, so if one of them is integer, it should be converted to numeric first.

thelatemail

cut is possibly useful here.

out <- cut(elements, t(intervals[c("start","end")]))
levels(out)[c(FALSE,TRUE)]  <- NA
intervals$phase[out]
#[1] "a" "a" "a" NA  "b" "b" "c"
Ben

David Arenburg's mention of non-equi joins was very helpful for understanding what general kind of problem this is (thanks!). I can see now that it's not implemented for dplyr. Thanks to this answer, I see that there is a fuzzyjoin package that can do it in the same idiom. But it's barely any simpler than my map solution above (though more readable, in my view), and doesn't hold a candle to thelatemail's cut answer for brevity.

For my example above, the fuzzyjoin solution would be

library(fuzzyjoin)
library(tidyverse)

fuzzy_left_join(data.frame(elements), intervals, 
                by = c("elements" = "start", "elements" = "end"), 
                match_fun = list(`>=`, `<=`)) %>% 
  distinct()

Which gives:

    elements phase start end
1      0.1     a     0   0.5
2      0.2     a     0   0.5
3      0.5     a     0   0.5
4      0.9  <NA>    NA    NA
5      1.1     b     1   1.9
6      1.9     b     1   1.9
7      2.1     c     2   2.5

Inspired by @thelatemail's cut solution, here is one using findInterval which still requires a lot of typing:

out <- findInterval(elements, t(intervals[c("start","end")]), left.open = TRUE)
out[!(out %% 2)] <- NA
intervals$phase[out %/% 2L + 1L]
#[1] "a" "a" "a" NA  "b" "b" "c"

Caveat cut and findInterval have left-open intervals. Therefore, solutions using cut and findInterval are not equivalent to Ben's using intrval, David's non-equi join using data.table, and my other solution using foverlaps.

Just lapply works:

l <- lapply(elements, function(x){
    intervals$phase[x >= intervals$start & x <= intervals$end]
})

str(l)
## List of 7
##  $ : chr "a"
##  $ : chr "a"
##  $ : chr "a"
##  $ : chr(0) 
##  $ : chr "b"
##  $ : chr "b"
##  $ : chr "c"

or in purrr, if you purrrfurrr,

elements %>% 
    map(~intervals$phase[.x >= intervals$start & .x <= intervals$end]) %>% 
    # Clean up a bit. Shorter, but less readable: map_chr(~.x[1] %||% NA)
    map_chr(~ifelse(length(.x) == 0, NA, .x))
## [1] "a" "a" "a" NA  "b" "b" "c"

Here is kind of a "one-liner" which (mis-)uses foverlaps from the data.table package but David's non-equi join is still more concise:

library(data.table) #v1.10.0
foverlaps(data.table(start = elements, end = elements), 
          setDT(intervals, key = c("start", "end")))
#   phase start end i.start i.end
#1:     a     0 0.5     0.1   0.1
#2:     a     0 0.5     0.2   0.2
#3:     a     0 0.5     0.5   0.5
#4:    NA    NA  NA     0.9   0.9
#5:     b     1 1.9     1.1   1.1
#6:     b     1 1.9     1.9   1.9
#7:     c     2 2.5     2.1   2.1

For completion sake, here is another way, using the intervals package:

library(tidyverse)
elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1)

intervalsDF <- 
  frame_data(  ~phase, ~start, ~end,
               "a",     0,      0.5,
               "b",     1,      1.9,
               "c",     2,      2.5
  )

library(intervals)
library(rlist)

interval_overlap(
  Intervals(intervalsDF %>% select(-phase) %>% as.matrix, closed = c(TRUE, TRUE)),
  Intervals(data_frame(start = elements, end = elements), closed = c(TRUE, TRUE))
) %>% 
  list.map(data_frame(interval_index = .i, element_index = .)) %>% 
  do.call(what = bind_rows)

# A tibble: 6 × 2
#  interval_index element_index
#           <int>         <int>
#1              1             1
#2              1             2
#3              1             3
#4              2             5
#5              2             6
#6              3             7
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!