Computing pairwise Hamming distance between all rows of two integer matrices/data frames

前端未结

关注

 3  1101

挽巷 2021-01-06 14:55

I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second be

3条回答

無奈伤痛 (楼主)

2021-01-06 15:42

Please don't be surprised why I take another section. This part gives something relevant. It is not what OP asks for, but may help any readers.

General hamming distance computation

In the previous answer, I start from a function hmd0 that computes hamming distance between two integer vectors of the same length. This means if we have 2 integer vectors:

set.seed(0)
x <- sample(1:100, 6)
y <- sample(1:100, 6)

we will end up with a scalar:

hmd0(x,y)
# 13

What if we want to compute pairwise hamming distance of two vectors?

In fact, a simple modification to our function hmd will do:

hamming.distance <- function(x, y, pairwise = TRUE) {
  nx <- length(x)
  ny <- length(y)
  rawx <- intToBits(x)
  rawy <- intToBits(y)
  if (nx == 1 && ny == 1) return(sum(as.logical(xor(intToBits(x),intToBits(y)))))
  if (nx < ny) {
    ## pivoting
    tmp <- rawx; rawx <- rawy; rawy <- tmp
    tmp <- nx; nx <- ny; ny <- tmp
    }
  if (nx %% ny) stop("unconformable length!") else {
    bits <- length(intToBits(0)) ## 32-bit or 64 bit?
    result <- unname(tapply(as.logical(xor(rawx,rawy)), rep(1:ny, each = bits), sum))
    }
  if (pairwise) result else sum(result)
  }

Now

hamming.distance(x, y, pairwise = TRUE)
# [1] 0 3 3 2 5 0
hamming.distance(x, y, pairwise = FALSE)
# [1] 13

Hamming distance matrix

If we want to compute the hamming distance matrix, for example,

set.seed(1)
x <- sample(1:100, 5)
y <- sample(1:100, 7)

The distance matrix between x and y is:

outer(x, y, hamming.distance)  ## pairwise argument has no effect here

#      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,]    2    3    4    3    4    4    2
# [2,]    7    6    3    4    3    3    3
# [3,]    4    5    4    3    6    4    2
# [4,]    2    3    2    5    6    4    2
# [5,]    4    3    4    3    2    0    2

We can also do:

outer(x, x, hamming.distance)

#     [,1] [,2] [,3] [,4] [,5]
# [1,]    0    5    2    2    4
# [2,]    5    0    3    5    3
# [3,]    2    3    0    2    4
# [4,]    2    5    2    0    4
# [5,]    4    3    4    4    0

In the latter situation, we end up with a symmetric matrix with 0 on the diagonal. Using outer is inefficient here, but it is still more efficient than writing R loops. Since our hamming.distance is written in R code, I would stay with using outer. In my answer to this question, I demonstrate the idea of using compiled code. This of course requires writing a C version of hamming.distance, but I will not show it here.

0 讨论(0)

查看其它3个回答