Encoding large numbers of categorical variables as input data

▼魔方 西西 提交于 2019-12-24 12:44:21

问题


One hot encoding doesn't sound like a great idea when you're dealing with hundreds of categories e.g. a data set where one of the columns is "first name". What's the best approach to go about encoding this sort of data?


回答1:


I recommend the hashing trick:

https://en.wikipedia.org/wiki/Feature_hashing#Feature_vectorization_using_the_hashing_trick

It's cheap to compute, easy to use, allows you to specify the dimensionality, and often serves as a very good basis for classification.

For your specific application, I would hash feature-value pairs, like ('FirstName','John'), then increment the bucket for the hashed value.




回答2:


If you have a large number of categories, Classification algorithm does not work well. Instead, there is a better approach of doing this. You apply regression algorithm on data and then train offset on those output. It would give you better results.

A sample code can be found here.



来源:https://stackoverflow.com/questions/35406985/encoding-large-numbers-of-categorical-variables-as-input-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!