Large distance matrix in clustering

一曲冷凌霜 提交于 2019-11-30 10:08:55
Patric

To simple, let assume you have 1 row (A) to cluster with 3^8 matrix (B) by minimum distance.

The original approach is :

1. load A and B
2. distance compute A with each row of B
3. select smallest one from results (reduction)

But because of B is really large, you can't load it to memory or error out during execution.

The batched approaches will like this:

1. load A (suppose it is small)
2. load B.partial with 1 to 1^5 rows of B
3. compute distance of A with each row of B.partial
4. select min one in partial results and save it as res[i]
5. go back 2.) load next 1^5 rows of B 
6. final your got a 3000 partial results and saved in res[1:3000]
7. reduction : select min one from res[1:3000]
   note: if you need all distances as `dist` function, you don't need reduction and just keep this array.

The code will be a little complicated than original one. But this is very common trick when we deal with big data problem. For compute parts, you can refer one of my previous answers in here.

I will be very appropriate if you can paste your final code with batch mode in here. So that others can study as well.


Another interesting things about dist is that it is the few one in R package supporting openMP. See source code in here and how to compile with openMP in here.

So, if you can try set OMP_NUM_THREADS with 4 or 8 based on your machine and then run again, you can see the performance improvement a lot!

 void R_distance(double *x, int *nr, int *nc, double *d, int *diag,
    int *method, double *p)
{
     int dc, i, j;
     size_t  ij;  /* can exceed 2^31 - 1 */
     double (*distfun)(double*, int, int, int, int) = NULL;
     #ifdef _OPENMP
        int nthreads;
     #endif
     .....
 }

Furthermore, if you want to accelerate dist by GPU, you can refer talk part in ParallelR.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!