问题
I am writing a mapreduce program for Kmeans clustering algorithm on a large data file. Each observation consists of columns which include both categorical and numerical variables. For Kmeans, it is not suitable to include categorical variable in the distance calculation. So we need to filter out the columns with categorical entries.
My question is: filtering out entries with characters is easy, but what if a column contains only numeric but treated as categorical (such as Zipcode, ID)?
Thank you!
回答1:
Removing all categorical variables is probably not the way to go. Did you try to transform your data set into a numerical data set? there are different methods, but for instance:
Given a categorical variable a (lets say colours) containing (say) 3 categories (black, white and blue), you can replace a in your data set with three new binary variables (a_1, a_2, a_3). For a given object, only one of these new binary variables should be equal to one, all others should be zero. So, if an object had a=black, then a_1=1, a_2=0, a_3=0.
You still need to standardise these new variables. There are different ways... you could just try a_1=a_1-mean(a_1) (the frequency).
来源:https://stackoverflow.com/questions/23328409/kmeans-dealing-with-categorical-variable