How to calculate Euclidean distance (and save only summaries) for large data frames

I've written a short 'for' loop to find the minimum euclidean distance between each row in a dataframe and all the other rows (and to record which row is closest). In theory this avoids the errors associated with trying to calculate distance measures for very large matrices. However, while not that much is being saved in memory, it is very very slow for large matrices (my use case of ~150K rows is still running).

I'm wondering whether anyone can advise or point me in the right direction in terms of vectorising my function, using apply or similar. Apologies for what may seem a simple question, but I'm still struggling to think in a vectorised way.

Thanks in advance (and for your patience).

require(proxy)

df<-data.frame(matrix(runif(10*10),nrow=10,ncol=10), row.names=paste("site",seq(1:10)))

min.dist<-function(df) {  
 #df for results
 all.min.dist<-data.frame()
 #set up for loop 
 for(k in 1:nrow(df)) {
     #calcuate dissimilarity between each row and all other rows
     df.dist<-dist(df[k,],df[-k,])
     # find minimum distance
     min.dist<-min(df.dist)
     # get rowname for minimum distance (id of nearest point)
     closest.row<-row.names(df)[-k][which.min(df.dist)]
     #combine outputs
     all.min.dist<-rbind(all.min.dist,data.frame(orig_row=row.names(df)[k],
     dist=min.dist, closest_row=closest.row))
    }
 #return results
 return(all.min.dist)
                        } 
 #example
 min.dist(df)

flodel

This should be a good start. It uses fast matrix operations and avoids the growing object construct, both suggested in the comments.

min.dist <- function(df) {

  which.closest <- function(k, df) {
    d <- colSums((df[, -k] - df[, k]) ^ 2)
    m <- which.min(d)
    data.frame(orig_row    = row.names(df)[k],
               dist        = sqrt(d[m]),
               closest_row = row.names(df)[-k][m])
  }

  do.call(rbind, lapply(1:nrow(df), which.closest, t(as.matrix(df))))
}

If this is still too slow, as a suggested improvement, you could compute the distances for k points at a time instead of a single one. The size of k will need to be a compromise between speed and memory usage.

Edit: Also read https://stackoverflow.com/a/16670220/1201032

Usually, built in functions are faster that coding it yourself (because coded in Fortran or C/C++ and optimized).

It seems that the function dist {stats} answers your question spot on:

Description This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.

来源：https://stackoverflow.com/questions/16474179/how-to-calculate-euclidean-distance-and-save-only-summaries-for-large-data-fra

标签

for-loop

distance

vectorization