How to generate a 'clusterable' dataset in MATLAB

你。 提交于 2019-12-30 10:28:06

问题


I need to test my Gap Statistics algorithm (which should tell me the optimum k for the dataset) and in order to do so I need to generate a big dataset easily clustarable, so that I know a priori the optimum number of clusters. Do you know any fast way to do it?


回答1:


It very much depends on what kind of dataset you expect - 1D, 2D, 3D, normal distribution, sparse, etc? And how big is "big"? Thousands, millions, billions of observations?

Anyway, my general approach to creating easy-to-identify clusters is concatenating sequential vectors of random numbers with different offsets and spreads:

DataSet = [5*randn(1000,1);20+3*randn(1,1000);120+25*randn(1,1000)];
Groups = [1*ones(1000,1);2*ones(1000,1);3*ones(1000,1)];

This can be extended to N features by using e.g.

randn(1000,5)

or concatenating horizontally

DataSet1 = [5*randn(1000,1);20+3*randn(1,1000);120+25*randn(1,1000)];
DataSet2 = [-100+7*randn(1000,1);1+0.1*randn(1,1000);20+3*randn(1,1000)];
DataSet = [DataSet1 DataSet2];

and so on.

randn also takes multidimensional inputs like

randn(1000,10,3);

For looking at higher-dimensional clusters.

If you don't have details on what kind of datasets this is going to be applied to, you should look for these.



来源:https://stackoverflow.com/questions/17049081/how-to-generate-a-clusterable-dataset-in-matlab

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!