Choosing classification algorithm to classify mix of nominal and numeric data?

北城以北 提交于 2019-12-03 03:51:42

The issue is that you're representing nominal variables on a continuous scale, which imposes a (spurious) ordinal relationship between classes when you use machine learning methods. For example, if you code address as one of six possible integers, then address 1 is closer to address 2 than it is to address 3,4,5,6. This is going to cause problems when you try to learn anything.

Instead, translate your 6-value categorical variable to six binary variables, one for each categorical value. Your original feature will then give rise to six features, where only one will ever be on. Also, keep the age as an integer value since you lose information by making it categorical.

As for approaches, it's unlikely to make much of a difference (at least initially). Go with whichever is easier for you to implement. However, make sure you run some sort of cross-validation parameter selection on a dev set before running on your test set, as all algorithms have parameters than can dramatically affect learning accuracy.

You really need to look at the data and determine if there is enough variance between your labels and the features that you currently have. Because there are so few features but a lot of data, something such as kNN could work well.

You could adapt collaborative filtering to solve your problem as that would also work off of similar features.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!