I have some data:
transaction <- c(1,2,3);
date <- c(\"2010-01-31\",\"2010-02-28\",\"2010-03-31\");
type <- c(\"debit\", \"debit\", \"credit\");
amo
Factors vs character vectors when doing stats: In terms of doing statistics, there's no difference in how R treats factors and character vectors. In fact, its often easier to leave factor variables as character vectors.
If you do a regression or ANOVA with lm() with a character vector as a categorical variable you'll get normal model output but with the message:
Warning message:
In model.matrix.default(mt, mf, contrasts) :
variable 'character_x' converted to a factor
Factors vs character vectors when manipulating dataframes: When manipulating dataframes, however, character vectors and factors are treated very differently. Some information on the annoyances of R & factors can be found on the Quantum Forest blog, R pitfall #3: friggin’ factors.
Its useful to use stringsAsFactors = FALSE
when reading data in from a .csv or .txt using read.table
or read.csv
. As noted in another reply you have to make sure that everything in your character vector is consistent, or else every typo will be designated as a different factor. You can use the function gsub() to fix typos.
Here is a worked example showing how lm() gives you the same results with a character vector and a factor.
A random independent variable:
continuous_x <- rnorm(10,10,3)
A random categorical variable as a character vector:
character_x <- (rep(c("dog","cat"),5))
Convert the character vector to a factor variable. factor_x <- as.factor(character_x)
Give the two categories random values:
character_x_value <- ifelse(character_x == "dog", 5*rnorm(1,0,1), rnorm(1,0,2))
Create a random relationship between the indepdent variables and a dependent variable
continuous_y <- continuous_x*10*rnorm(1,0) + character_x_value
Compare the output of a linear model with the factor variable and the character vector. Note the warning that is given with the character vector.
summary(lm(continuous_y ~ continuous_x + factor_x))
summary(lm(continuous_y ~ continuous_x + character_x))