问题

I am fitting the same Generalized Additive Model on multiple data sets using the bam function from mgcv. While for most of my data sets the fit completes within a reasonable time between 10 and 20 minutes. For a few data sets the run take more than 10 hours to complete. I cannot find any similarities between the slow cases, the final fit is neither exceptionally good nor bad, nor do they contain any noticeable outliers.

How can I figure out why the fit is so slow for these instances? And how might I be able to speed these up?

My model contains two smooth terms (using a cyclic cubic spline basis) and some additional numerical and factor variables. In total 300 coefficients (including those for smooth terms) are estimated. I keep the number of knots intentionally below information theoretically optimal numbers to speed up the fitting process. My data sets contain around 850k rows.

This is the function call:

bam(
    value
    ~ 0
    + weekday_x
    + weekday
    + time
    + "a couple of factor variables encoding special events"
    + delta:weekday
    + s(share_of_year, k=length(knotsYear), bs="cc")
    + s(share_of_year_x, k=length(knotsYear), bs="cc")
    , knots=list(
      share_of_year=knotsYear
      , share_of_year_x=knotsYear
    )
    , family=quasipoisson()
    , data=data
)

knotsYears contains 26 knots.

This model converges reasonably fast for most cases but incredibly slow for a few.

回答1:

A most likely cause: "fREML" failure

In a typical model like above, without tensor smooth te or ti, my experience is that REML iteration fails for some cases.

The canonical bam implementation uses fast.REML.fit. The convergence test of this routine needs a fix, but as Simon has moved on and focus more on the discrete method now, he is not keen on fixing it. The fixed version is (at the moment) only available in an extension pack for testing, "Zheyuan add-on", supplemented to my PhD thesis. But fast.REML.fit is also not that fragile, and such convergence failure is not often, otherwise piles of big report would get this issue fixed back in 2012.

The following just helps you make a check not a fix.

Let fit be your fitted model that takes 10 hours, check fit$outer.info. This gives number of REML iterations it takes, and the convergence information like gradient and Hessian. If you see iter = 200, or any information saying some failure like "step failed", then you know why it takes so long. But if you look at gradient, most likely you will see that it is almost zero. In other words, REML iteration has actually converged but fast.REML.fit fails to detect it.

Another check: "performance iteration"

Since you are fitting a GAM not an AM (additive model with Gaussian response), there is another P-IRLS (penalized iteratively re-weighed least squares) outside the REML iteration. Yes, the whole (canonical) bam estimation is a double loop nest, called "performance iteration". This may fail, too, but such failure is more intrinsic and may not be overcome, as the "performance iteration" is not guaranteed to converge. So, check fit$iter to see if it is also very big (in the worst case it can be 200). mgcv manual has a dedicated section ?gam.convergence discussing this type of convergence failure, and it is the driving reason that "outer iteration" is desirable. However, for large dataset, "outer iteration" is too expensive to implement. So, bear with "performance iteration".

mgcv has a "tracing" option. If you set control = gam.control(trace = TRUE) when calling bam, the deviance information and iteration counter will be printed to screen as "performance iteration" goes. This gives you a clear path of the penalized deviance, so you can inspect whether it is cycling around or trapped at some point. This is more informative than a single iteration number stored in fit$iter.

Perhaps try the new method?

Maybe you want to try the new discrete = TRUE (added in 2015; paper formally published in 2017). It uses a new fitting iteration. We are (MUCH) happier to test its practical convergence capability, than the old method. When using it, turn on "trace", too. If it fails to converge, think about reporting it, but we need a reproducible case.

来源：https://stackoverflow.com/questions/40788972/why-is-bam-from-mgcv-slow-for-some-data

标签