How to vectorize json data for KMeans?

杀马特。学长 韩版系。学妹 提交于 2020-01-02 19:47:25

问题


I have a number of questions and choices which users are going to answer. They have the format like this:

question_id, text, choices

And for each user I store the answered questions and selected choice by each user as a json in mongodb:

{user_id: "",  "question_answers" : [{"question_id": "choice_id", ..}] }

Now I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions but I need to convert my user data to some vector numbers like the example in Spark's Docs here.

kmeans data sample and my desired output:

0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2

I've already tried using scikit-learn's DictVectorizer but it doesn't seem to be working fine.

I created a key for each question_choice combination like this:

from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{'question_1_choice_1': 1, 'question_1_choice_2': 1}, ..]
X = v.fit_transform(D)

And I try to transform each of my user's question/choice pairs into this:

v.transform({'question_1_choice_2': 1, ...})

And I get a result like this:

[[ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.]]

Is this the right approach? Because I need to create a dict of all my choices and answers every time. Is there a way to do this in Spark?

Thanks in advance. Sorry I'm new to data science.


回答1:


Don't use K-Means with categorical data. Let me quote How to understand the drawbacks of K-means by KevinKim:

  • k-means assume the variance of the distribution of each attribute (variable) is spherical;

  • all variables have the same variance;

  • the prior probability for all k clusters are the same, i.e. each cluster has roughly equal number of observations; If any one of these 3 assumptions is violated, then k-means will fail.

With encoded categorical data the first two assumptions are almost sure to violated.

For further discussion see K-means clustering is not a free lunch by David Robinson.

I'm trying to use K-Means clustering and streaming to find most similar users based on their choices of questions

For similarity searches use MinHashLSH with approximate joins:

  • https://spark.apache.org/docs/latest/ml-features.html#minhash-for-jaccard-distance

You'll have to StringIndex and OneHotEncode all variables for that as shown in the following answers :

  • How to handle categorical features with spark-ml?

  • Fit a dataframe into randomForest pyspark

See also the comment by henrikstroem.



来源:https://stackoverflow.com/questions/45835524/how-to-vectorize-json-data-for-kmeans

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!