Find values in a given interval without a vector scan

拈花ヽ惹草 提交于 2019-12-10 18:08:17

问题


With a the R package data.table is it possible to find the values that are in a given interval without a full vector scan of the data. For example

>DT<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89))
>my.data.table.function(DT,min=3,max=10)
   x
1: 3
2: 5
3: 8

Where DT can be a very big table.

Bonus question: is it possible to do the same thing for a set of non-overlapping intervals such as

>I<-data.table(i=c(1,2),min=c(3,20),max=c(10,40))
>I
   i min max
1: 1   3  10
2: 2  20  40
> my.data.table.function2(DT,I)
   i  x
1: 1  3
2: 1  5
3: 1  8
4: 2 21
5: 2 34

Where both I and DT can be very big. Thanks a lot


回答1:


First of all, vecseq isn't exported as a visible function from data.table, so its syntax and/or behavior here could change without warning in future updates to the package. Also, this is untested besides the simple identical check at the end.

That out of the way, we need a bigger example to exhibit difference from vector scan approach:

require(data.table)

n <- 1e5L
f <- 10L
ni <- n / f

set.seed(54321)
DT <- data.table(x = 1:n + sample(-f:f, n, replace = TRUE))
IT <- data.table(i = 1:ni, 
                 min = seq(from = 1L, to = n, by = f) + sample(0:4, ni, replace = TRUE),
                 max = seq(from = 1L, to = n, by = f) + sample(5:9, ni, replace = TRUE))

DT, the Data Table is a not-too-random subset of 1:n. IT, the Interval Table is ni = n / 10 non-overlapping intervals in 1:n. Doing the repeated vector scan on all ni intervals takes a while:

system.time({
  ans.vecscan <- IT[, DT[x >= min & x <= max], by = i]
})
 ##  user  system elapsed 
 ## 84.15    4.48   88.78

One can do two rolling joins on the interval endpoints (see the roll argument in ?data.table) to get everything in one swoop:

system.time({
  # Save time if DT is already keyed correctly
  if(!identical(key(DT), "x")) setkey(DT, x)

  DT[, row := .I]

  setkey(IT, min)

  target.low <- IT[DT, roll = Inf, nomatch = 0][, list(min = row[1]), keyby = i]

  # Non-overlapping intervals => (sorted by min => sorted by max)
  setattr(IT, "sorted", "max")

  target.high <- IT[DT, roll = -Inf, nomatch = 0][, list(max = last(row)), keyby = i]

  target <- target.low[target.high, nomatch = 0]
  target[, len := max - min + 1L]


  rm(target.low, target.high)

  ans.roll <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
  ans.roll[, row := NULL]
  setcolorder(ans.roll, c("i", "x"))
})
 ## user  system elapsed 
 ## 0.12    0.00    0.12

Ensuring the same row order verifies the result:

setkey(ans.vecscan, i, x)
setkey(ans.roll, i, x)
identical(ans.vecscan, ans.roll)
## [1] TRUE



回答2:


Here is a variation of the code proposed by @user1935457 (see comment in @user1935457 post)

system.time({

 if(!identical(key(DT), "x")) setkey(DT, x)
 setkey(IT, min)

 #below is the line that differs from @user1935457 
 #Using IT to address the lines of DT creates a smaller intermediate table
 #We can also directly use .I 
 target.low<-DT[IT,list(i=i,min=.I),roll=-Inf, nomatch = 0][,list(min=min[1]),keyby=i]
 setattr(IT, "sorted", "max")

 # same here
 target.high<-DT[IT,list(i=i,max=.I),roll=Inf, nomatch = 0][,list(max=last(max)),keyby=i]
 target <- target.low[target.high, nomatch = 0]
 target[, len := max - min + 1L]

 rm(target.low, target.high)
 ans.roll2 <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
 setcolorder(ans.roll2, c("i", "x"))
})
#    user  system elapsed 
#    0.07    0.00    0.06 


system.time({ 
 # @user1935457 code
 })
#    user  system elapsed 
#    0.08    0.00    0.08 

identical(ans.roll2, ans.roll)
#[1] TRUE

The performance gain is not huge here, but it shall be more sensitive with larger DT and smaller IT. thanks again to @user1935457 for your answer.




回答3:


If you don't want to do a full vector scan, you should first declare your variable as a key for your data.table :

DT <- data.table(x=c(1,1,2,3,5,8,13,21,34,55,89),key="x")

Then you can use %between% :

R> DT[x %between% c(3,10),]
   x
1: 3
2: 5
3: 8

R> DT[x %between% c(3,10) | x %between% c(20,40),]
    x
1:  3
2:  5
3:  8
4: 21
5: 34

EDIT : As @mnel pointed out, %between% still does vector scans. The Note section of the help page says :

Current implementation does not make use of ordered keys.

So this doesn't answer your question.



来源:https://stackoverflow.com/questions/16666183/find-values-in-a-given-interval-without-a-vector-scan

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!