I’ve got a large input matrix (4000x10000). I use dist()
to calculate the Euclidean distance matrix for it (it takes about 5 hours).
I need to calculate the distance matrix for the "same" matrix with an additional row (for a 4001x10000 matrix). What is the fastest way to determine the distance matrix without recalculating the whole matrix?
I'll assume your extra row means an extra point. If it means an extra variable/dimension, it will call for a different answer.
First of all, for euclidean distance of matrices, I'd recommend the rdist
function from the fields
package. It is written in Fortran and is a lot faster than the dist
function. It returns a matrix
instead of a dist
object, but you can always go from one to the other using as.matrix
and as.dist
.
Here is (smaller than yours) sample data
num.points <- 400
num.vars <- 1000
original.points <- matrix(runif(num.points * num.vars),
nrow = num.points, ncol = num.vars)
and the distance matrix you already computed:
d0 <- rdist(original.points)
For the extra point(s), you only need to compute the distances among the extra points and the distances between the extra points and the original points. I will use two extra points to show that the solution is general to any number of extra points:
extra.points <- matrix(runif(2 * num.vars), nrow = 2)
inner.dist <- rdist(extra.points)
outer.dist <- rdist(extra.points, original.points)
so you can bind them to your bigger distance matrix:
d1 <- rbind(cbind(d0, t(outer.dist)),
cbind(outer.dist, inner.dist))
Let's check that it matches what a full, long rerun would have produced:
d2 <- rdist(rbind(original.points, extra.points))
identical(d1, d2)
# [1] TRUE
来源:https://stackoverflow.com/questions/13668675/recalculating-distance-matrix