hclust size limit?

前端未结

关注

 2  2005

I\'m new to R. I\'m trying to run hclust() on about 50K items. I have 10 columns to compare and 50K rows of data. When I tried assigning the distance matrix, I get: \"Cannot

相关标签:

2条回答

深忆病人

2020-12-18 15:45

Classic hierarchical clustering approaches are O(n^3) in runtime and O(n^2) in memory complexity. So yes, they scale incredibly bad to large data sets. Obviously, anything that requires materialization of the distance matrix is in O(n^2) or worse.

Note that there are some specializations of hierarchical clustering such as SLINK and CLINK that run in O(n^2), and depending on the implementation may also only need O(n) memory.

You might want to look into more modern clustering algorithms. Anything that runs in O(n log n) or better should work for you. There are plenty of good reasons to not use hierarchical clustering: usually it is rather sensitive to noise (i.e. it doesn't really know what to do with outliers) and the results are hard to interpret for large data sets (dendrograms are nice, but only for small data sets).

0 讨论(0)
发布评论:

提交评论
- 加载中...
感情败类

2020-12-18 16:04
The size limit is being set by your hardware and software, and you have not given enough specifics to say much more. On a machine with adequate resources you would not be getting this error. Why not try a 10% sample before diving into the deep end of the pool? Perhaps starting with:
```
reduced <- full[ sample(1:nrow(full), nrow(full)/10 ) , ]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...