Matching two very very large vectors with tolerance (fast! but working space sparing)

*爱你&永不变心* 提交于 2019-12-01 09:12:40

Your match condition

abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5

can be re-written as

sample[i] * (1 - eps) < referencelist < sample[i] * (1 + eps)

with eps = 0.5E-6.

Using this, we can use a non-equi-join to find all matches (not only the nearest!) in referencelist for each sample:

library(data.table)
options(digits = 10)
eps <- 0.5E-6 # tol * 1E6
setDT(referencelist)[.(value = sample, 
                       lower = sample * (1 - eps), 
                       upper = sample * (1 + eps)), 
                     on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]

which reproduces the expected result:

   name     value reference
1:    A 154.00315 154.00312
2:    G 159.02991 159.02992
3:    B 154.07688 154.07685
4:    E 156.77312 156.77310

In response to OP's comment, let's say, we have a modified referencelist2 with F = 154.00320 then this will be caught too:

setDT(referencelist2)[.(value = sample, 
                       lower = sample * (1 - eps), 
                       upper = sample * (1 + eps)), 
                     on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]
   name     value reference
1:    A 154.00315 154.00312
2:    F 154.00315 154.00320
3:    G 159.02991 159.02992
4:    B 154.07688 154.07685
5:    E 156.77312 156.77310

Using data.table (and copy-pasting from @eddi's binary search (also called bisection, cf @John Coleman's comment)):

library(data.table)

dt <- as.data.table(referencelist)
setattr(dt, "sorted", "value")

tol <- 0.5
dt2 <- dt[J(sample), .(.I, ref = value, name), roll = "nearest", by = .EACHI]
dt2[, diff := abs(ref - value) / value * 1e6]
dt2[diff <= tol]

#       value I      ref name       diff
# 1: 154.0032 1 154.0031    A 0.19480121
# 2: 159.0299 7 159.0299    G 0.06288125
# 3: 154.0769 2 154.0769    B 0.19470799
# 4: 156.7731 5 156.7731    E 0.12757289

I haven't benchmarked memory usage nor execution time, but data.table has the reputation of being very good at both. If it doesn't work for you, say so and maybe I'll try to benchmark things.

Note: my use of data.table is quite naive.

And there's a solution using findInterval just below: https://stackoverflow.com/a/29552922/6197649, but I'd expect it to perform worse (again: would require benchmarks).

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!