Large distance matrix in clustering

I am running R 3.2.3 on a machine with 16 GB RAM. I have a large matrix of 3,00,000 rows x 12 columns. I wanna use a hierarchical clustering algorithm in R, so before I do that, I am trying to create a distance matrix. Since data is of mixed type, I use different matrices for different type. I get an error about memory allocation:

df <- as.data.frame(matrix(rnorm(36*10^5), nrow = 3*10^5))
d1=as.dist(distm(df[,c(1:2)])/10^5)
d2=dist(df[,c(3:8)], method = "euclidean") 
d3= hamming.distance(df[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)

I get the following error

> d1=as.dist(distm(df1[,c(1:2)])/10^5)
Error: cannot allocate vector of size 670.6 Gb
In addition: Warning messages:
1: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
2: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
3: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
4: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
> d2=dist(df1[,c(3:8)], method = "euclidean") 
Error: cannot allocate vector of size 335.3 Gb
In addition: Warning messages:
1: In dist(df1[, c(3:8)], method = "euclidean") :
 Reached total allocation of 16070Mb: see help(memory.size)
2: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
3: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
4: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
> d3= hamming.distance(df1[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)
Error: cannot allocate vector of size 670.6 Gb
In addition: Warning messages:
1: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
2: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
3: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
4: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)

Patric

To simple, let assume you have 1 row (A) to cluster with 3^8 matrix (B) by minimum distance.

The original approach is :

1. load A and B
2. distance compute A with each row of B
3. select smallest one from results (reduction)

But because of B is really large, you can't load it to memory or error out during execution.

The batched approaches will like this:

1. load A (suppose it is small)
2. load B.partial with 1 to 1^5 rows of B
3. compute distance of A with each row of B.partial
4. select min one in partial results and save it as res[i]
5. go back 2.) load next 1^5 rows of B 
6. final your got a 3000 partial results and saved in res[1:3000]
7. reduction : select min one from res[1:3000]
   note: if you need all distances as `dist` function, you don't need reduction and just keep this array.

The code will be a little complicated than original one. But this is very common trick when we deal with big data problem. For compute parts, you can refer one of my previous answers in here.

I will be very appropriate if you can paste your final code with batch mode in here. So that others can study as well.

Another interesting things about dist is that it is the few one in R package supporting openMP. See source code in here and how to compile with openMP in here.

So, if you can try set OMP_NUM_THREADS with 4 or 8 based on your machine and then run again, you can see the performance improvement a lot!

 void R_distance(double *x, int *nr, int *nc, double *d, int *diag,
    int *method, double *p)
{
     int dc, i, j;
     size_t  ij;  /* can exceed 2^31 - 1 */
     double (*distfun)(double*, int, int, int, int) = NULL;
     #ifdef _OPENMP
        int nthreads;
     #endif
     .....
 }

Furthermore, if you want to accelerate dist by GPU, you can refer talk part in ParallelR.

来源：https://stackoverflow.com/questions/34281593/large-distance-matrix-in-clustering

标签

matrix

distance

hierarchical-clustering