Computing pairwise Hamming distance between all rows of two integer matrices/data frames

前端 未结 3 1091
挽巷
挽巷 2021-01-06 14:55

I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second be

3条回答
  •  無奈伤痛
    2021-01-06 15:42

    Please don't be surprised why I take another section. This part gives something relevant. It is not what OP asks for, but may help any readers.


    General hamming distance computation

    In the previous answer, I start from a function hmd0 that computes hamming distance between two integer vectors of the same length. This means if we have 2 integer vectors:

    set.seed(0)
    x <- sample(1:100, 6)
    y <- sample(1:100, 6)
    

    we will end up with a scalar:

    hmd0(x,y)
    # 13
    

    What if we want to compute pairwise hamming distance of two vectors?

    In fact, a simple modification to our function hmd will do:

    hamming.distance <- function(x, y, pairwise = TRUE) {
      nx <- length(x)
      ny <- length(y)
      rawx <- intToBits(x)
      rawy <- intToBits(y)
      if (nx == 1 && ny == 1) return(sum(as.logical(xor(intToBits(x),intToBits(y)))))
      if (nx < ny) {
        ## pivoting
        tmp <- rawx; rawx <- rawy; rawy <- tmp
        tmp <- nx; nx <- ny; ny <- tmp
        }
      if (nx %% ny) stop("unconformable length!") else {
        bits <- length(intToBits(0)) ## 32-bit or 64 bit?
        result <- unname(tapply(as.logical(xor(rawx,rawy)), rep(1:ny, each = bits), sum))
        }
      if (pairwise) result else sum(result)
      }
    

    Now

    hamming.distance(x, y, pairwise = TRUE)
    # [1] 0 3 3 2 5 0
    hamming.distance(x, y, pairwise = FALSE)
    # [1] 13
    

    Hamming distance matrix

    If we want to compute the hamming distance matrix, for example,

    set.seed(1)
    x <- sample(1:100, 5)
    y <- sample(1:100, 7)
    

    The distance matrix between x and y is:

    outer(x, y, hamming.distance)  ## pairwise argument has no effect here
    
    #      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
    # [1,]    2    3    4    3    4    4    2
    # [2,]    7    6    3    4    3    3    3
    # [3,]    4    5    4    3    6    4    2
    # [4,]    2    3    2    5    6    4    2
    # [5,]    4    3    4    3    2    0    2
    

    We can also do:

    outer(x, x, hamming.distance)
    
    #     [,1] [,2] [,3] [,4] [,5]
    # [1,]    0    5    2    2    4
    # [2,]    5    0    3    5    3
    # [3,]    2    3    0    2    4
    # [4,]    2    5    2    0    4
    # [5,]    4    3    4    4    0
    

    In the latter situation, we end up with a symmetric matrix with 0 on the diagonal. Using outer is inefficient here, but it is still more efficient than writing R loops. Since our hamming.distance is written in R code, I would stay with using outer. In my answer to this question, I demonstrate the idea of using compiled code. This of course requires writing a C version of hamming.distance, but I will not show it here.

提交回复
热议问题