Computing pairwise Hamming distance between all rows of two integer matrices/data frames

前端未结

关注

 3  1116

挽巷 2021-01-06 14:55

I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second be

3条回答

夕颜 (楼主)

2021-01-06 15:17
Fast computation of hamming distance between two integers vectors of equal length

As I said in my comment, we can do:
```
hmd0 <- function(x,y) sum(as.logical(xor(intToBits(x),intToBits(y))))
```
to compute hamming distance between two integers vectors of equal length x and y. This only uses R base, yet is more efficient than e1071::hamming.distance, because it is vectorized!

For the example x and y in your post, this gives 25. (My other answer will show what we should do, if we want pairwise hamming distance.)

Fast hamming distance between a matrix and a vector

If we want to compute the hamming distance between a single y and multiple xs, i.e., the hamming distance between a vector and a matrix, we can use the following function.
```
hmd <- function(x,y) {
  rawx <- intToBits(x)
  rawy <- intToBits(y)
  nx <- length(rawx)
  ny <- length(rawy)
  if (nx == ny) {
    ## quick return
    return (sum(as.logical(xor(rawx,rawy))))
    } else if (nx < ny) {
    ## pivoting
    tmp <- rawx; rawx <- rawy; rawy <- tmp
    tmp <- nx; nx <- ny; ny <- tmp
    }
  if (nx %% ny) stop("unconformable length!") else {
    nc <- nx / ny  ## number of cycles
    return(unname(tapply(as.logical(xor(rawx,rawy)), rep(1:nc, each=ny), sum)))
    }
  }
```
Note that:
1. hmd performs computation column-wise. It is designed to be CPU cache friendly. In this way, if we want to do some row-wise computation, we should transpose the matrix first;
2. there is no obvious loop here; instead, we use tapply().
Fast hamming distance computation between two matrices/data frames

This is what you want. The following function foo takes two data frames or matrices df1 and df2, computing the distance between df1 and each row of df2. argument p is an integer, showing how many results you want to retain. p = 3 will keep the smallest 3 distances with their row ids in df1.
```
foo <- function(df1, df2, p) {
  ## check p
  if (p > nrow(df2)) p <- nrow(df2)
  ## transpose for CPU cache friendly code
  xt <- t(as.matrix(df1))
  yt <- t(as.matrix(df2))
  ## after transpose, we compute hamming distance column by column
  ## a for loop is decent; no performance gain from apply family
  n <- ncol(yt)
  id <- integer(n * p)
  d <- numeric(n * p)
  k <- 1:p
  for (i in 1:n) {
    distance <- hmd(xt, yt[,i])
    minp <- order(distance)[1:p]
    id[k] <- minp
    d[k] <- distance[minp]
    k <- k + p
    }
  ## recode "id" and "d" into data frame and return
  id <- as.data.frame(matrix(id, ncol = p, byrow = TRUE))
  colnames(id) <- paste0("min.", 1:p)
  d <- as.data.frame(matrix(d, ncol = p, byrow = TRUE))
  colnames(d) <- paste0("mindist.", 1:p)
  list(id = id, d = d)
  }
```
Note that:
1. transposition is done at the beginning, according to reasons before;
2. a for loop is used here. But this is actually efficient because there is considerable computation done in each iteration. It is also more elegant than using *apply family, since we ask for multiple output (row id id and distance d).
Experiment

This part uses small dataset to test/demonstrate our functions.

Some toy data:
```
set.seed(0)
df1 <- as.data.frame(matrix(sample(1:10), ncol = 2))  ## 5 rows 2 cols
df2 <- as.data.frame(matrix(sample(1:6), ncol = 2))  ## 3 rows 2 cols
```
Test hmd first (needs transposition):
```
hmd(t(as.matrix(df1)), df2[1, ])  ## df1 & first row of df2
# [1] 2 4 6 2 4
```
Test foo:
```
foo(df1, df2, p = 2)

# $id
#   min1 min2
# 1    1    4
# 2    2    3
# 3    5    2

# $d
#   mindist.1 mindist.2
# 1         2         2
# 2         1         3
# 3         1         3
```
If you want to append some columns to df2, you know what to do, right?
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...