Find values in a given interval without a vector scan

问题

With a the R package data.table is it possible to find the values that are in a given interval without a full vector scan of the data. For example

>DT<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89))
>my.data.table.function(DT,min=3,max=10)
   x
1: 3
2: 5
3: 8

Where DT can be a very big table.

Bonus question: is it possible to do the same thing for a set of non-overlapping intervals such as

>I<-data.table(i=c(1,2),min=c(3,20),max=c(10,40))
>I
   i min max
1: 1   3  10
2: 2  20  40
> my.data.table.function2(DT,I)
   i  x
1: 1  3
2: 1  5
3: 1  8
4: 2 21
5: 2 34

Where both I and DT can be very big. Thanks a lot

回答1:

First of all, vecseq isn't exported as a visible function from data.table, so its syntax and/or behavior here could change without warning in future updates to the package. Also, this is untested besides the simple identical check at the end.

That out of the way, we need a bigger example to exhibit difference from vector scan approach:

require(data.table)

n <- 1e5L
f <- 10L
ni <- n / f

set.seed(54321)
DT <- data.table(x = 1:n + sample(-f:f, n, replace = TRUE))
IT <- data.table(i = 1:ni, 
                 min = seq(from = 1L, to = n, by = f) + sample(0:4, ni, replace = TRUE),
                 max = seq(from = 1L, to = n, by = f) + sample(5:9, ni, replace = TRUE))

DT, the Data Table is a not-too-random subset of 1:n. IT, the Interval Table is ni = n / 10 non-overlapping intervals in 1:n. Doing the repeated vector scan on all ni intervals takes a while:

system.time({
  ans.vecscan <- IT[, DT[x >= min & x <= max], by = i]
})
 ##  user  system elapsed 
 ## 84.15    4.48   88.78

One can do two rolling joins on the interval endpoints (see the roll argument in ?data.table) to get everything in one swoop:

system.time({
  # Save time if DT is already keyed correctly
  if(!identical(key(DT), "x")) setkey(DT, x)

  DT[, row := .I]

  setkey(IT, min)

  target.low <- IT[DT, roll = Inf, nomatch = 0][, list(min = row[1]), keyby = i]

  # Non-overlapping intervals => (sorted by min => sorted by max)
  setattr(IT, "sorted", "max")

  target.high <- IT[DT, roll = -Inf, nomatch = 0][, list(max = last(row)), keyby = i]

  target <- target.low[target.high, nomatch = 0]
  target[, len := max - min + 1L]


  rm(target.low, target.high)

  ans.roll <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
  ans.roll[, row := NULL]
  setcolorder(ans.roll, c("i", "x"))
})
 ## user  system elapsed 
 ## 0.12    0.00    0.12

Ensuring the same row order verifies the result:

setkey(ans.vecscan, i, x)
setkey(ans.roll, i, x)
identical(ans.vecscan, ans.roll)
## [1] TRUE

回答2:

Here is a variation of the code proposed by @user1935457 (see comment in @user1935457 post)

system.time({

 if(!identical(key(DT), "x")) setkey(DT, x)
 setkey(IT, min)

 #below is the line that differs from @user1935457 
 #Using IT to address the lines of DT creates a smaller intermediate table
 #We can also directly use .I 
 target.low<-DT[IT,list(i=i,min=.I),roll=-Inf, nomatch = 0][,list(min=min[1]),keyby=i]
 setattr(IT, "sorted", "max")

 # same here
 target.high<-DT[IT,list(i=i,max=.I),roll=Inf, nomatch = 0][,list(max=last(max)),keyby=i]
 target <- target.low[target.high, nomatch = 0]
 target[, len := max - min + 1L]

 rm(target.low, target.high)
 ans.roll2 <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
 setcolorder(ans.roll2, c("i", "x"))
})
#    user  system elapsed 
#    0.07    0.00    0.06 


system.time({ 
 # @user1935457 code
 })
#    user  system elapsed 
#    0.08    0.00    0.08 

identical(ans.roll2, ans.roll)
#[1] TRUE

The performance gain is not huge here, but it shall be more sensitive with larger DT and smaller IT. thanks again to @user1935457 for your answer.

回答3:

If you don't want to do a full vector scan, you should first declare your variable as a key for your data.table :

DT <- data.table(x=c(1,1,2,3,5,8,13,21,34,55,89),key="x")

Then you can use %between% :

R> DT[x %between% c(3,10),]
   x
1: 3
2: 5
3: 8

R> DT[x %between% c(3,10) | x %between% c(20,40),]
    x
1:  3
2:  5
3:  8
4: 21
5: 34

EDIT : As @mnel pointed out, %between% still does vector scans. The Note section of the help page says :

Current implementation does not make use of ordered keys.

So this doesn't answer your question.

来源：https://stackoverflow.com/questions/16666183/find-values-in-a-given-interval-without-a-vector-scan

标签

data.table

intervals