Invalid Characters causing error in rlm()

末鹿安然 提交于 2021-01-28 12:37:22

问题


A data frame which has invalid characters in the column names is causing an error in rlm().

Taking a deeper look, it appears that within rlm() the variable xvars contains the names of the formula's explanatory variables, but it puts backticks around the offending names. Then when xvars is used as an index to a data frame, namesly mf[xvars] it causes the following error:

Error in `[.data.frame`(mf, xvars) : undefined columns selected

Is this the expected behavior? (I realize the keyword phrase invalid characters). Curiously, calling lm() on the same model and dataframe causes no problems.

# SAMPLE DATA
mydf <- data.frame(matrix(rnorm(36),ncol=6))
colnames(mydf) <- c("y", "x1", "x2", "x1^2", "x2^2", "x1:x2")

rlm(y~., data=mydf)  # Error

lm(y~., data=mydf)   # No Problem

# Clean up column names
colnames(mydf) <- make.names(colnames(mydf))
rlm(y~., data=mydf) # No Problem 

Taking a look at MASS:::rlm.formula, it appears the error is
caused by mf[xvars] in the following lines:

xlev <- if (length(xvars) > 0L) {
    xlev <- lapply(mf[xvars], levels)
    xlev[!sapply(xlev, is.null)]
}

Any thoughts why the backticks are being added but then causing an error?


Additional Info

I copied the rlm() function, added dput(mf) & dput(xvars) and got the following values. Note that the value of xvars is different than the names assigned above (ie, backticks are added). Also, the names of mf are the same as the names given above.

# dput yielded
mf <- structure(list(y = c(-0.242914027018629, 0.724255425682537, -0.0578467214604185, -0.274193999595702, -0.38985000750839, 0.406046200943395), x1 = c(1.53071709960635, -1.87493297716611, 1.0936519723035, -0.977011182431237, -0.510890461021046, 1.20136627562427), x2 = c(-0.801995963036553, 1.30590232081605, 0.635922235436178, -1.86824341731708, -2.76797814532917, -0.497992681627495), `x1^2` = c(0.914146279518207, 0.103458073891876, -1.29818230391818, -0.629048606358592, 1.71534374557621, 0.922690967521984), `x2^2` = c(-0.0879726513660469, 1.05299413769867, 1.01955640371072, 0.546413685721721, 0.947757793667223, -0.0998700630220064), `x1:x2` = c(-0.757490494166813, 1.31307393014016, 1.90233916482184, 0.68844011701049, -1.28717997826724, -0.581800325341162)), .Names = c("y", "x1", "x2", "x1^2", "x2^2", "x1:x2"), terms = y ~     x1 + x2 + `x1^2` + `x2^2` + `x1:x2`, row.names = c(NA, 6L), class = "data.frame")
xvars <- c("x1", "x2", "`x1^2`", "`x2^2`", "`x1:x2`")

mf[xvars]  
# Error in `[.data.frame`(mf, xvars) : undefined columns selected


# Removing the backticks from xvars eliminates the error.
xvars <- sapply(xvars, function(x) gsub("`", "", x))
mf[xvars2]  # No Error

回答1:


Your issue boils down to the fact you are using non-syntatic variable names.

These should be used with caution, and without expectation that package authors will be able to anticipate any issues that may arise.

To quote from the help for formula

Variable names can be quoted by backticks like this in formulae, although there is no guarantee that all code using formulae will accept such non-syntactic names.

The issue in how xvars is created rlm.formula

xvars <- as.character(attr(mt, "variables"))[-1L]

and then the use later on

xlev <- if (length(xvars) > 0L) {
        xlev <- lapply(mf[xvars], levels)
        xlev[!sapply(xlev, is.null)]
    }

Which, as you show, does not work

This will create quoted back-ticked variables for non-syntatic names. If they are already backticked, then they will create double back-ticked names

i.e. if the column name was "x1^2", the element in xvar becomes "`x1^2`".

This fails with [.data.frame for example

x <- data.frame(`a` = 1)
> x[,'`a`']

Error in `[.data.frame`(x, , "`a`") : undefined columns selected

Because the column name is 'a' not `a`

If you backtick the column name

i.e. if the column name was "`x1^2`", the element in xvar becomes "``x1^2``".

which again is not a column in your data.frame

The reason lm works is that it does not attempt this definition and use of xvars, instead it uses model.matrix to define the design matrix x directly to pass to lm.fit

If you want to fit the model y ~ x1 + x2 + x1:x2 +x1^2 + y1^2 then you can using

rlm(y ~ x1*x2 + I(x1^2) + I(x2^2)

In this case you only need three columns in your data.frame (or objects in your evaluation environment) y, x1 and x2. as the I() function allows to perform arithmetic operations on a variable, as I is parsed as a symbol by terms.formula



来源:https://stackoverflow.com/questions/13327287/invalid-characters-causing-error-in-rlm

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!