Recalculating distance matrix

狂风中的少年 提交于 2019-12-05 07:54:57

问题


I’ve got a large input matrix (4000x10000). I use dist() to calculate the Euclidean distance matrix for it (it takes about 5 hours).
I need to calculate the distance matrix for the "same" matrix with an additional row (for a 4001x10000 matrix). What is the fastest way to determine the distance matrix without recalculating the whole matrix?


回答1:


I'll assume your extra row means an extra point. If it means an extra variable/dimension, it will call for a different answer.

First of all, for euclidean distance of matrices, I'd recommend the rdist function from the fields package. It is written in Fortran and is a lot faster than the dist function. It returns a matrix instead of a dist object, but you can always go from one to the other using as.matrix and as.dist.

Here is (smaller than yours) sample data

num.points <- 400
num.vars   <- 1000
original.points <- matrix(runif(num.points * num.vars),
                          nrow = num.points, ncol = num.vars)

and the distance matrix you already computed:

d0 <- rdist(original.points)

For the extra point(s), you only need to compute the distances among the extra points and the distances between the extra points and the original points. I will use two extra points to show that the solution is general to any number of extra points:

extra.points <- matrix(runif(2 * num.vars), nrow = 2)
inner.dist   <- rdist(extra.points)
outer.dist   <- rdist(extra.points, original.points)

so you can bind them to your bigger distance matrix:

d1 <- rbind(cbind(d0, t(outer.dist)),
            cbind(outer.dist, inner.dist))

Let's check that it matches what a full, long rerun would have produced:

d2 <- rdist(rbind(original.points, extra.points))

identical(d1, d2)
# [1] TRUE


来源:https://stackoverflow.com/questions/13668675/recalculating-distance-matrix

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!