问题
consider I have two vectors. One is a reference vector/list that includes all values of interest and one samplevector that could contain any possible value. Now I want to find matches of my sample inside the reference list with a certain tolerance which is not fixed and depentent on the comparing values inside the vectors:
matches: abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5
rounding both vectors is no option!
for example consider:
referencelist <- read.table(header=TRUE, text="value name
154.00312 A
154.07685 B
154.21452 C
154.49545 D
156.77310 E
156.83991 F
159.02992 G
159.65553 H
159.93843 I")
sample <- c(154.00315, 159.02991, 154.07688, 156.77312)
so I get the result:
name value reference
1 A 154.00315 154.00312
2 G 159.02991 159.02992
3 B 154.07688 154.07685
4 E 156.77312 156.77310
what I can do is using e.g. the outer function like
myDist <- outer(referencelist, sample, FUN=function(x, y) abs(((x - y)/y)*10^6))
matches <- which(myDist < 0.5, arr.ind=TRUE)
data.frame(name = referencelist$name[matches[, 1]], value=sample[matches[, 2]])
or I could use a for()
loop.
But my special problem is, that the reference vector has around 1*10^12 entries and my sample vector around 1*10^7. so by using outer() I easily destroy all working space limits and by using a for()
or chained for()
loop this will took days/weeks to finish.
Has anybody an idea of how to do this fast in R, still precise but working on a computer consuming max. 64 GB RAM?
Thanks for any help!
Best whishes
回答1:
Your match condition
abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5
can be re-written as
sample[i] * (1 - eps) < referencelist < sample[i] * (1 + eps)
with eps = 0.5E-6
.
Using this, we can use a non-equi-join to find all matches (not only the nearest!) in referencelist
for each sample
:
library(data.table)
options(digits = 10)
eps <- 0.5E-6 # tol * 1E6
setDT(referencelist)[.(value = sample,
lower = sample * (1 - eps),
upper = sample * (1 + eps)),
on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]
which reproduces the expected result:
name value reference 1: A 154.00315 154.00312 2: G 159.02991 159.02992 3: B 154.07688 154.07685 4: E 156.77312 156.77310
In response to OP's comment, let's say, we have a modified referencelist2
with F = 154.00320
then this will be caught too:
setDT(referencelist2)[.(value = sample,
lower = sample * (1 - eps),
upper = sample * (1 + eps)),
on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]
name value reference 1: A 154.00315 154.00312 2: F 154.00315 154.00320 3: G 159.02991 159.02992 4: B 154.07688 154.07685 5: E 156.77312 156.77310
回答2:
Using data.table
(and copy-pasting from @eddi's binary search (also called bisection, cf @John Coleman's comment)):
library(data.table)
dt <- as.data.table(referencelist)
setattr(dt, "sorted", "value")
tol <- 0.5
dt2 <- dt[J(sample), .(.I, ref = value, name), roll = "nearest", by = .EACHI]
dt2[, diff := abs(ref - value) / value * 1e6]
dt2[diff <= tol]
# value I ref name diff
# 1: 154.0032 1 154.0031 A 0.19480121
# 2: 159.0299 7 159.0299 G 0.06288125
# 3: 154.0769 2 154.0769 B 0.19470799
# 4: 156.7731 5 156.7731 E 0.12757289
I haven't benchmarked memory usage nor execution time, but data.table
has the reputation of being very good at both. If it doesn't work for you, say so and maybe I'll try to benchmark things.
Note: my use of data.table
is quite naive.
And there's a solution using findInterval
just below: https://stackoverflow.com/a/29552922/6197649, but I'd expect it to perform worse (again: would require benchmarks).
来源:https://stackoverflow.com/questions/46957566/matching-two-very-very-large-vectors-with-tolerance-fast-but-working-space-spa