How to calculate Euclidean distance (and save only summaries) for large data frames

*爱你&永不变心* 提交于 2019-12-01 20:47:38
flodel

This should be a good start. It uses fast matrix operations and avoids the growing object construct, both suggested in the comments.

min.dist <- function(df) {

  which.closest <- function(k, df) {
    d <- colSums((df[, -k] - df[, k]) ^ 2)
    m <- which.min(d)
    data.frame(orig_row    = row.names(df)[k],
               dist        = sqrt(d[m]),
               closest_row = row.names(df)[-k][m])
  }

  do.call(rbind, lapply(1:nrow(df), which.closest, t(as.matrix(df))))
}

If this is still too slow, as a suggested improvement, you could compute the distances for k points at a time instead of a single one. The size of k will need to be a compromise between speed and memory usage.

Edit: Also read https://stackoverflow.com/a/16670220/1201032

Usually, built in functions are faster that coding it yourself (because coded in Fortran or C/C++ and optimized).

It seems that the function dist {stats} answers your question spot on:

Description This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!