How to correctly `dput` a fitted linear model (by `lm`) to an ASCII file and recreate it later?

前端 未结 2 1537
借酒劲吻你
借酒劲吻你 2020-12-18 01:09

I want to persist a lm object to a file and reload it into another program. I know I can do this by writing/reading a binary file via saveRDS/

相关标签:
2条回答
  • 2020-12-18 01:54

    This is an important update!

    As mentioned in the previous answer, the most challenging bit is to recover $terms as best as we can. The suggested method using terms.formula works for OP's example, but not for the following with bs() and poly():

    dat <- data.frame(x1 = runif(20), x2 = runif(20), x3 = runif(20), y = rnorm(20))
    library(splines)
    fit <- lm(y ~ bs(x1, df = 3) + poly(x2, degree = 3) + x3, data = dat)
    rm(dat)
    

    If we follow the previous answer:

    dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R") 
    fit1 <- source("model.R")$value
    fit1$terms <- terms.formula(fit1$terms)
    

    We will see that summary.lm and anova.lm work correctly, but not predict.lm:

    predict(fit1, newdata = data.frame(x1 = 0.5, x2 = 0.5, x3 = 0.5))
    

    Error in bs(x1, df = 3) : could not find function "bs"

    This is because ".Environment" attribute of $terms is missing. We need

    environment(fit1$terms) <- .GlobalEnv
    

    Now run above predict again we see a different error:

    Error in poly(x2, degree = 3) :

    'degree' must be less than number of unique points

    This is because we are missing "predvars" attributes for safe / correct prediction of bs() and poly().

    A remedy is that we need to dput such special attribute additionally:

    dput(attr(fit$terms, "predvars"), control = "quoteExpressions", file = "predvars.R")
    

    then read and add it

    attr(fit1$terms, "predvars") <- source("predvars.R")$value
    

    Now running predict works correctly.

    Note that "dataClass" attribute of $terms is also missing, but this does not seem to cause any problem for any generic functions.

    0 讨论(0)
  • 2020-12-18 02:05

    Step 1:

    You need to control de-parsing options:

    dput(fit, control = c("quoteExpressions", "showAttributes"), file = "model.R") 
    

    You can read more on all possible options in ?.deparseOpts.


    The "quoteExpressions" wraps all calls / expressions / languages with quote, so that they are not evaluated when you later re-parse it. Note:

    • source is doing parsing;
    • call field in your fitted "lm" object is a call:

      fit$call
      # lm(formula = z ~ x, data = dat_train)
      

    So, without "quoteExpressions", R will try to evaluate lm call during parsing. And if we evaluate it, it is fitting a linear model, and R will aim to find dat_train, which will not exist in your new R session.


    The "showAttributes" is another mandatory option, as "lm" object has class attributes. You certainly don't want to discard all class attributes and only export a plain "list" object, right? Moreover, many elements in a "lm" object, like model (the model frame), qr (the compact QR matrix) and terms (terms info), etc all have attributes. You want to keep them all.


    If you don't set control, the default setting with:

    control = c("keepNA", "keepInteger", "showAttributes")
    

    will be used. As you can see, there is no "quoteExpressions", so you will get into trouble.

    You can also specify "keepInteger" and "keepNA", but I don't see the need for "lm" object.

    ------

    Step 2:

    The above step will get source working correctly. You can recover your model:

    fit1 <- source("model.R")$value
    

    However, it is not yet ready for generic functions like summary and predict to work. Why?

    The critical issue is the terms object in fit1 is not really a "terms" object, but only a formula (it is even not a formula, but only a "language" object without "formula" class!). Just compare fit$terms and fit1$terms, and you will see the difference. Don't be surprised; we've set "quoteExpressions" earlier. While that is definitely helpful to prevent evaluation of call, it has side-effect for terms. So we need to reconstruct terms as best as we can.

    Fortunately, it is sufficient to do:

    fit1$terms <- terms.formula(fit1$terms)
    

    Though this still does not recover all information in fit$terms (like variable classes are missing), it is readily a valid "terms" object.

    Why is a "terms" object critical? Because all generic functions rely on it. You may not need to know more on this, as it is really technical, so I will stop here.

    Once this is done, we can successfully use predict (and summary, too):

    predict(fit1)  ## no `newdata` given, using model frame `fit1$model`
    #   1    2    3    4 
    #1.03 2.01 2.99 3.97 
    
    predict(fit1, dat_score)  ## with `newdata`
    #   1    2 
    #1.52 3.48 
    

    -------

    Conclusion remark:

    Although I have shown you how to get things work, I don't really recommend you doing this in general. An "lm" object will be pretty large when you fit a model to a large dataset, for example, residuals, fitted.values are long vectors, and qr and model are huge matrices / data frames. So think about this.

    0 讨论(0)
提交回复
热议问题