Is there an advantage to ordering a categorical variable?

僤鯓⒐⒋嵵緔 提交于 2019-12-02 01:20:29

Among other things, it allows you to compare values from those factors:

> ord.fac <- ordered(c("small", "medium", "large"), levels=c("small", "medium", "large"))
> fac <- factor(c("small", "medium", "large"), levels=c("small", "medium", "large"))
> ord.fac[[1]] < ord.fac[[2]]
[1] TRUE
> fac[[1]] < fac[[2]]
[1] NA
Warning message:
  In Ops.factor(fac[[1]], fac[[2]]) : < not meaningful for factors

Documentation suggests there is quite an impact from a modeling perspective:

Ordered factors differ from factors only in their class, but methods and the model-fitting functions treat the two classes quite differently

but I'll have to let someone familiar with those use cases provide the details on that.

You should use ordinal data only when it makes sense from the data's point of view (i.e. the data is naturally ordered like in the case of small, medium and large).

In modeling terms, a categorical variable has a dummy variable created for each level buy one of the possible values it can take. The effect of the dummy variable essentially gives you the effect of that level compared to the reference level (the level without a dummy variable). In general, dealing with a categorical variable is easier that dealing with an ordinal data.

Ordinal data is not modeled in the same way as continuous and categorical (unless you treat the values as continuous, which is often done). In R, the ordinal package has several functions to perform the modeling that are based on a cumulative link function (a link function transforms the data to something that is closer to linear regression).

The advantage of recoding categorical data as ordinal is that the inferences made from the data are better represent the data and have a more intuitive interpretation.

The most useful difference is in displaying results. If we have levels low, med, and high and create an appropriate ordered factor then boxplots, barplots, tables, etc. will display the results in the order low, med, high. But if we create an unordered factor and go with the default ordering then the plots/tables will put things in the order high, low, med; which makes less sense.

The default contrasts/dummy variable encoding is different for ordered and non-ordered factors (but you can change the encoding, so this only affects things if you use the defaults) which can change interpretations of individual pieces, but will not affect the overall fit in general (for the linear model and extensions, other tools like trees could be different).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!