I want to persist a lm
object to a file and reload it into another program. I know I can do this by writing/reading a binary file via saveRDS
/
This is an important update!
As mentioned in the previous answer, the most challenging bit is to recover $terms
as best as we can. The suggested method using terms.formula
works for OP's example, but not for the following with bs()
and poly()
:
dat <- data.frame(x1 = runif(20), x2 = runif(20), x3 = runif(20), y = rnorm(20))
library(splines)
fit <- lm(y ~ bs(x1, df = 3) + poly(x2, degree = 3) + x3, data = dat)
rm(dat)
If we follow the previous answer:
dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R")
fit1 <- source("model.R")$value
fit1$terms <- terms.formula(fit1$terms)
We will see that summary.lm
and anova.lm
work correctly, but not predict.lm
:
predict(fit1, newdata = data.frame(x1 = 0.5, x2 = 0.5, x3 = 0.5))
Error in bs(x1, df = 3) : could not find function "bs"
This is because ".Environment"
attribute of $terms
is missing. We need
environment(fit1$terms) <- .GlobalEnv
Now run above predict
again we see a different error:
Error in poly(x2, degree = 3) :
'degree' must be less than number of unique points
This is because we are missing "predvars"
attributes for safe / correct prediction of bs()
and poly()
.
A remedy is that we need to dput
such special attribute additionally:
dput(attr(fit$terms, "predvars"), control = "quoteExpressions", file = "predvars.R")
then read and add it
attr(fit1$terms, "predvars") <- source("predvars.R")$value
Now running predict
works correctly.
Note that "dataClass"
attribute of $terms
is also missing, but this does not seem to cause any problem for any generic functions.
Step 1:
You need to control de-parsing options:
dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R")
You can read more on all possible options in ?.deparseOpts
.
The "quoteExpressions" wraps all calls / expressions / languages with quote
, so that they are not evaluated when you later re-parse it. Note:
source
is doing parsing;call
field in your fitted "lm" object is a call:
fit$call
# lm(formula = z ~ x, data = dat_train)
So, without "quoteExpressions", R will try to evaluate lm
call during parsing. And if we evaluate it, it is fitting a linear model, and R will aim to find dat_train
, which will not exist in your new R session.
The "showAttributes" is another mandatory option, as "lm" object has class attributes. You certainly don't want to discard all class attributes and only export a plain "list" object, right? Moreover, many elements in a "lm" object, like model
(the model frame), qr
(the compact QR matrix) and terms
(terms info), etc all have attributes. You want to keep them all.
If you don't set control
, the default setting with:
control = c("keepNA", "keepInteger", "showAttributes")
will be used. As you can see, there is no "quoteExpressions", so you will get into trouble.
You can also specify "keepInteger" and "keepNA", but I don't see the need for "lm" object.
Step 2:
The above step will get source
working correctly. You can recover your model:
fit1 <- source("model.R")$value
However, it is not yet ready for generic functions like summary
and predict
to work. Why?
The critical issue is the terms
object in fit1
is not really a "terms" object, but only a formula (it is even not a formula, but only a "language" object without "formula" class!). Just compare fit$terms
and fit1$terms
, and you will see the difference. Don't be surprised; we've set "quoteExpressions" earlier. While that is definitely helpful to prevent evaluation of call
, it has side-effect for terms
. So we need to reconstruct terms
as best as we can.
Fortunately, it is sufficient to do:
fit1$terms <- terms.formula(fit1$terms)
Though this still does not recover all information in fit$terms
(like variable classes are missing), it is readily a valid "terms" object.
Why is a "terms" object critical? Because all generic functions rely on it. You may not need to know more on this, as it is really technical, so I will stop here.
Once this is done, we can successfully use predict
(and summary
, too):
predict(fit1) ## no `newdata` given, using model frame `fit1$model`
# 1 2 3 4
#1.03 2.01 2.99 3.97
predict(fit1, dat_score) ## with `newdata`
# 1 2
#1.52 3.48
Conclusion remark:
Although I have shown you how to get things work, I don't really recommend you doing this in general. An "lm" object will be pretty large when you fit a model to a large dataset, for example, residuals
, fitted.values
are long vectors, and qr
and model
are huge matrices / data frames. So think about this.