问题
One hot encoding doesn't sound like a great idea when you're dealing with hundreds of categories e.g. a data set where one of the columns is "first name". What's the best approach to go about encoding this sort of data?
回答1:
I recommend the hashing trick:
https://en.wikipedia.org/wiki/Feature_hashing#Feature_vectorization_using_the_hashing_trick
It's cheap to compute, easy to use, allows you to specify the dimensionality, and often serves as a very good basis for classification.
For your specific application, I would hash feature-value pairs, like ('FirstName','John'), then increment the bucket for the hashed value.
回答2:
If you have a large number of categories, Classification algorithm does not work well. Instead, there is a better approach of doing this. You apply regression algorithm on data and then train offset on those output. It would give you better results.
A sample code can be found here.
来源:https://stackoverflow.com/questions/35406985/encoding-large-numbers-of-categorical-variables-as-input-data