Calculating all distances between one point and a group of points efficiently in R

后端 未结 5 611
太阳男子
太阳男子 2020-12-15 09:55

First of all, I am new to R (I started yesterday).

I have two groups of points, data and centers, the first one of size n and

相关标签:
5条回答
  • 2020-12-15 10:26

    dist works fast because is't vectorized and call internal C functions.
    You code in loop could be vectorized in many ways.

    For example to compute distance between data and centers you could use outer:

    diff_ij <- function(i,j) sqrt(rowSums((data[i,]-centers[j,])^2))
    X <- outer(seq_len(n), seq_len(K), diff_ij)
    

    This gives you n x K matrix of distances. And should be way faster than loop.

    Then you could use max.col to find maximum in each row (see help, there are some nuances when are many maximums). X must be negate cause we search for minimum.

    CL <- max.col(-X)
    

    To be efficient in R you should vectorized as possible. Loops could be in many cases replaced by vectorized substitute. Check help for rowSums (which describe also rowMeans, colSums, rowSums), pmax, cumsum. You could search SO, e.g. https://stackoverflow.com/search?q=[r]+avoid+loop (copy&paste this link, I don't how to make it clickable) for some examples.

    0 讨论(0)
  • 2020-12-15 10:29

    rdist() is a R function from {fields} package which is able to calculate distances between two sets of points in matrix format quickly.

    https://www.image.ucar.edu/~nychka/Fields/Help/rdist.html

    Usage :

    library(fields)
    #generating fake data
    n <- 5
    m <- 10
    d <- 3
    
    x <- matrix(rnorm(n * d), ncol = d)
    y <- matrix(rnorm(m * d), ncol = d)
    
    rdist(x, y)
              [,1]     [,2]      [,3]     [,4]     [,5]
     [1,] 1.512383 3.053084 3.1420322 4.942360 3.345619
     [2,] 3.531150 4.593120 1.9895867 4.212358 2.868283
     [3,] 1.925701 2.217248 2.4232672 4.529040 2.243467
     [4,] 2.751179 2.260113 2.2469334 3.674180 1.701388
     [5,] 3.303224 3.888610 0.5091929 4.563767 1.661411
     [6,] 3.188290 3.304657 3.6668867 3.599771 3.453358
     [7,] 2.891969 2.823296 1.6926825 4.845681 1.544732
     [8,] 2.987394 1.553104 2.8849988 4.683407 2.000689
     [9,] 3.199353 2.822421 1.5221291 4.414465 1.078257
    [10,] 2.492993 2.994359 3.3573190 6.498129 3.337441
    
    0 讨论(0)
  • 2020-12-15 10:31

    My solution:

    # data is a matrix where each row is a point
    # point is a vector of values
    euc.dist <- function(data, point) {
      apply(data, 1, function (row) sqrt(sum((point - row) ^ 2)))
    }
    

    You can try it, like:

    x <- matrix(rnorm(25), ncol=5)
    euc.dist(x, x[1,])
    
    0 讨论(0)
  • 2020-12-15 10:41

    You may want to have a look into the apply functions.

    For instance, this code

    for (j in 1:K)
        {
        d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
        }
    

    Can easily be substituted by something like

    dt <- data[i,]
    d <- apply(centers, 1, function(x){ sqrt(sum(x-dt)^2)})
    

    You can definitely optimise it more but you get the point I hope

    0 讨论(0)
  • 2020-12-15 10:48

    Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K.

    # Generate some fake data.
    n <- 3823
    K <- 10
    d <- 64
    x <- matrix(rnorm(n * d), ncol = n)
    centers <- matrix(rnorm(K * d), ncol = K)
    
    system.time(
      dists <- apply(centers, 2, function(center) {
        colSums((x - center)^2)
    })
    )
    

    Runs in:

    utilisateur     système      écoulé 
          0.100       0.008       0.108 
    

    on my laptop.

    0 讨论(0)
提交回复
热议问题