Matching two very very large vectors with tolerance (fast! but working space sparing)

consider I have two vectors. One is a reference vector/list that includes all values of interest and one samplevector that could contain any possible value. Now I want to find matches of my sample inside the reference list with a certain tolerance which is not fixed and depentent on the comparing values inside the vectors:

matches: abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5

rounding both vectors is no option!

for example consider:

referencelist <- read.table(header=TRUE, text="value  name
154.00312  A
154.07685  B
154.21452  C
154.49545  D
156.77310  E
156.83991  F
159.02992  G
159.65553  H
159.93843  I")

sample <- c(154.00315, 159.02991, 154.07688, 156.77312)

so I get the result:

    name value      reference
1    A   154.00315  154.00312
2    G   159.02991  159.02992
3    B   154.07688  154.07685
4    E   156.77312  156.77310

what I can do is using e.g. the outer function like

myDist <- outer(referencelist, sample, FUN=function(x, y) abs(((x - y)/y)*10^6))
matches <- which(myDist < 0.5, arr.ind=TRUE)
data.frame(name = referencelist$name[matches[, 1]], value=sample[matches[, 2]])

or I could use a for() loop.

But my special problem is, that the reference vector has around 1*10^12 entries and my sample vector around 1*10^7. so by using outer() I easily destroy all working space limits and by using a for() or chained for() loop this will took days/weeks to finish.

Has anybody an idea of how to do this fast in R, still precise but working on a computer consuming max. 64 GB RAM?

Thanks for any help!

Best whishes

Your match condition

abs(((referencelist - sample[i])/sample[i])*10^6)) < 0.5

can be re-written as

sample[i] * (1 - eps) < referencelist < sample[i] * (1 + eps)

with eps = 0.5E-6.

Using this, we can use a non-equi-join to find all matches (not only the nearest!) in referencelist for each sample:

library(data.table)
options(digits = 10)
eps <- 0.5E-6 # tol * 1E6
setDT(referencelist)[.(value = sample, 
                       lower = sample * (1 - eps), 
                       upper = sample * (1 + eps)), 
                     on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]

which reproduces the expected result:

   name     value reference
1:    A 154.00315 154.00312
2:    G 159.02991 159.02992
3:    B 154.07688 154.07685
4:    E 156.77312 156.77310

In response to OP's comment, let's say, we have a modified referencelist2 with F = 154.00320 then this will be caught too:

setDT(referencelist2)[.(value = sample, 
                       lower = sample * (1 - eps), 
                       upper = sample * (1 + eps)), 
                     on = .(ref > lower, ref < upper), .(name, value, reference = x.ref)]

   name     value reference
1:    A 154.00315 154.00312
2:    F 154.00315 154.00320
3:    G 159.02991 159.02992
4:    B 154.07688 154.07685
5:    E 156.77312 156.77310

Using data.table (and copy-pasting from @eddi's binary search (also called bisection, cf @John Coleman's comment)):

library(data.table)

dt <- as.data.table(referencelist)
setattr(dt, "sorted", "value")

tol <- 0.5
dt2 <- dt[J(sample), .(.I, ref = value, name), roll = "nearest", by = .EACHI]
dt2[, diff := abs(ref - value) / value * 1e6]
dt2[diff <= tol]

#       value I      ref name       diff
# 1: 154.0032 1 154.0031    A 0.19480121
# 2: 159.0299 7 159.0299    G 0.06288125
# 3: 154.0769 2 154.0769    B 0.19470799
# 4: 156.7731 5 156.7731    E 0.12757289

I haven't benchmarked memory usage nor execution time, but data.table has the reputation of being very good at both. If it doesn't work for you, say so and maybe I'll try to benchmark things.

Note: my use of data.table is quite naive.

And there's a solution using findInterval just below: https://stackoverflow.com/a/29552922/6197649, but I'd expect it to perform worse (again: would require benchmarks).

来源：https://stackoverflow.com/questions/46957566/matching-two-very-very-large-vectors-with-tolerance-fast-but-working-space-spa

标签

vector

matching