Spark MLlib / K-Means intuition

寵の児 提交于 2019-12-05 17:50:26

Well, first of all KMeans is a clustering algorithm and as such unsupervised. So there is no "checking of the training set against itself" (well okay you can do it manually ;).

Your understanding is quite good actually, just that you miss the point that model.predict(Utils.featurize(t)) gives you the cluster that t belongs as assigned by KMeans. I think you want to check

models.predict(Utils.featurize(t)) == i

in your code since i iterates through all cluster labels.

Also a small remark: The feature vector is created on a 2-gram model of characters of the tweets. This intermediate step is important ;)

2-gram (for words) means: "A bear shouts at a bear" => {(A, bear), (bear, shouts), (shouts, at), (at, a), (a bear)} i.e. "a bear" is counted twice. Chars would be (A,[space]), ([space], b), (b, e) and so on.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!