Is there any good reason for columns to be characters instead of factors?

做~自己de王妃 提交于 2019-12-23 15:12:50

问题


This mind seem like a silly question, but after working with R for a couple of months I realised I often find myself converting strings to factors as, for example, the tabulate function does not work on strings.

At this point I am contemplating simply always converting any string to a factor. But that begs the question, is there any reason not to (apart from carrying out operations on the string itself)?


回答1:


Factors have a dual representation -- the 'label'; and underlying encoding of the level. Which of these representations is used by R can be subtle and confusing.

One illustration of where this can be confusing is with subsetting. Here's a named vector, a character vector, and a factor with default (alphabetically ordered) levels

x = c(foo = 1, bar = 2)
y = c("bar", "foo")
z = factor(y)        # default levels are "bar", "foo", i.e., alphabetical

Subsetting x by y matches character value to name, but subsetting x by z uses the underlying level encoding.

> x[y]
bar foo 
  2   1 
> x[z]
foo bar 
  1   2 

This can be made even more confusing because R can work in different locales (e.g., I am using en_US locale -- US English) and the collation (sort) order of different locales can be different -- default levels might be different in different locales.



来源:https://stackoverflow.com/questions/52351773/is-there-any-good-reason-for-columns-to-be-characters-instead-of-factors

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!