问题
With a the R package data.table
is it possible to find the values that are in a given interval without a full vector scan of the data. For example
>DT<-data.table(x=c(1,1,2,3,5,8,13,21,34,55,89))
>my.data.table.function(DT,min=3,max=10)
x
1: 3
2: 5
3: 8
Where DT
can be a very big table.
Bonus question: is it possible to do the same thing for a set of non-overlapping intervals such as
>I<-data.table(i=c(1,2),min=c(3,20),max=c(10,40))
>I
i min max
1: 1 3 10
2: 2 20 40
> my.data.table.function2(DT,I)
i x
1: 1 3
2: 1 5
3: 1 8
4: 2 21
5: 2 34
Where both I
and DT
can be very big.
Thanks a lot
回答1:
First of all, vecseq
isn't exported as a visible function from data.table
, so its syntax and/or behavior here could change without warning in future updates to the package. Also, this is untested besides the simple identical
check at the end.
That out of the way, we need a bigger example to exhibit difference from vector scan approach:
require(data.table)
n <- 1e5L
f <- 10L
ni <- n / f
set.seed(54321)
DT <- data.table(x = 1:n + sample(-f:f, n, replace = TRUE))
IT <- data.table(i = 1:ni,
min = seq(from = 1L, to = n, by = f) + sample(0:4, ni, replace = TRUE),
max = seq(from = 1L, to = n, by = f) + sample(5:9, ni, replace = TRUE))
DT
, the Data Table is a not-too-random subset of 1:n
. IT
, the Interval Table is ni = n / 10
non-overlapping intervals in 1:n
. Doing the repeated vector scan on all ni
intervals takes a while:
system.time({
ans.vecscan <- IT[, DT[x >= min & x <= max], by = i]
})
## user system elapsed
## 84.15 4.48 88.78
One can do two rolling joins on the interval endpoints (see the roll
argument in ?data.table
) to get everything in one swoop:
system.time({
# Save time if DT is already keyed correctly
if(!identical(key(DT), "x")) setkey(DT, x)
DT[, row := .I]
setkey(IT, min)
target.low <- IT[DT, roll = Inf, nomatch = 0][, list(min = row[1]), keyby = i]
# Non-overlapping intervals => (sorted by min => sorted by max)
setattr(IT, "sorted", "max")
target.high <- IT[DT, roll = -Inf, nomatch = 0][, list(max = last(row)), keyby = i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
ans.roll[, row := NULL]
setcolorder(ans.roll, c("i", "x"))
})
## user system elapsed
## 0.12 0.00 0.12
Ensuring the same row order verifies the result:
setkey(ans.vecscan, i, x)
setkey(ans.roll, i, x)
identical(ans.vecscan, ans.roll)
## [1] TRUE
回答2:
Here is a variation of the code proposed by @user1935457 (see comment in @user1935457 post)
system.time({
if(!identical(key(DT), "x")) setkey(DT, x)
setkey(IT, min)
#below is the line that differs from @user1935457
#Using IT to address the lines of DT creates a smaller intermediate table
#We can also directly use .I
target.low<-DT[IT,list(i=i,min=.I),roll=-Inf, nomatch = 0][,list(min=min[1]),keyby=i]
setattr(IT, "sorted", "max")
# same here
target.high<-DT[IT,list(i=i,max=.I),roll=Inf, nomatch = 0][,list(max=last(max)),keyby=i]
target <- target.low[target.high, nomatch = 0]
target[, len := max - min + 1L]
rm(target.low, target.high)
ans.roll2 <- DT[data.table:::vecseq(target$min, target$len, NULL)][, i := unlist(mapply(rep, x = target$i, times = target$len, SIMPLIFY=FALSE))]
setcolorder(ans.roll2, c("i", "x"))
})
# user system elapsed
# 0.07 0.00 0.06
system.time({
# @user1935457 code
})
# user system elapsed
# 0.08 0.00 0.08
identical(ans.roll2, ans.roll)
#[1] TRUE
The performance gain is not huge here, but it shall be more sensitive with larger DT
and smaller IT
. thanks again to @user1935457 for your answer.
回答3:
If you don't want to do a full vector scan, you should first declare your variable as a key for your data.table
:
DT <- data.table(x=c(1,1,2,3,5,8,13,21,34,55,89),key="x")
Then you can use %between%
:
R> DT[x %between% c(3,10),]
x
1: 3
2: 5
3: 8
R> DT[x %between% c(3,10) | x %between% c(20,40),]
x
1: 3
2: 5
3: 8
4: 21
5: 34
EDIT : As @mnel pointed out, %between%
still does vector scans. The Note section of the help page says :
Current implementation does not make use of ordered keys.
So this doesn't answer your question.
来源:https://stackoverflow.com/questions/16666183/find-values-in-a-given-interval-without-a-vector-scan