KMeans dealing with categorical variable

问题

I am writing a mapreduce program for Kmeans clustering algorithm on a large data file. Each observation consists of columns which include both categorical and numerical variables. For Kmeans, it is not suitable to include categorical variable in the distance calculation. So we need to filter out the columns with categorical entries.

My question is: filtering out entries with characters is easy, but what if a column contains only numeric but treated as categorical (such as Zipcode, ID)?

Thank you!

回答1:

Removing all categorical variables is probably not the way to go. Did you try to transform your data set into a numerical data set? there are different methods, but for instance:

Given a categorical variable a (lets say colours) containing (say) 3 categories (black, white and blue), you can replace a in your data set with three new binary variables (a_1, a_2, a_3). For a given object, only one of these new binary variables should be equal to one, all others should be zero. So, if an object had a=black, then a_1=1, a_2=0, a_3=0.

You still need to standardise these new variables. There are different ways... you could just try a_1=a_1-mean(a_1) (the frequency).

来源：https://stackoverflow.com/questions/23328409/kmeans-dealing-with-categorical-variable

标签

Hadoop

MapReduce

k-means

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!