How to vectorize json data for KMeans?

[亡魂溺海] 提交于 2019-12-06 05:40:01
user8371915

Don't use K-Means with categorical data. Let me quote How to understand the drawbacks of K-means by KevinKim:

  • k-means assume the variance of the distribution of each attribute (variable) is spherical;

  • all variables have the same variance;

  • the prior probability for all k clusters are the same, i.e. each cluster has roughly equal number of observations; If any one of these 3 assumptions is violated, then k-means will fail.

With encoded categorical data the first two assumptions are almost sure to violated.

For further discussion see K-means clustering is not a free lunch by David Robinson.

I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions

For similarity searches use MinHashLSH with approximate joins:

You'll have to StringIndex and OneHotEncode all variables for that as shown in the following answers :

See also the comment by henrikstroem.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!