Using statsmodel estimations with scikit-learn cross validation, is it possible?

问题

I posted this question to Cross Validated forum and later realized may be this would find appropriate audience in stackoverlfow instead.

I am looking for a way I can use the fit object (result) ontained from python statsmodel to feed into cross_val_score of scikit-learn cross_validation method? The attached link suggests that it may be possible but I have not succeeded.

I am getting the following error

estimator should a be an estimator implementing 'fit' method statsmodels.discrete.discrete_model.BinaryResultsWrapper object at 0x7fa6e801c590 was passed

Refer this link

回答1:

Indeed, you cannot use cross_val_score directly on statsmodels objects, because of different interface: in statsmodels

training data is passed directly into the constructor
a separate object contains the result of model estimation

However, you can write a simple wrapper to make statsmodels objects look like sklearn estimators:

import statsmodels.api as sm
from sklearn.base import BaseEstimator, RegressorMixin

class SMWrapper(BaseEstimator, RegressorMixin):
    """ A universal sklearn-style wrapper for statsmodels regressors """
    def __init__(self, model_class, fit_intercept=True):
        self.model_class = model_class
        self.fit_intercept = fit_intercept
    def fit(self, X, y):
        if self.fit_intercept:
            X = sm.add_constant(X)
        self.model_ = self.model_class(y, X)
        self.results_ = self.model_.fit()
    def predict(self, X):
        if self.fit_intercept:
            X = sm.add_constant(X)
        return self.results_.predict(X)

This class contains correct fit and predict methods, and can be used with sklearn, e.g. cross-validated or included into a pipeline. Like here:

from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

X, y = make_regression(random_state=1, n_samples=300, noise=100)

print(cross_val_score(SMWrapper(sm.OLS), X, y, scoring='r2'))
print(cross_val_score(LinearRegression(), X, y, scoring='r2'))

You can see that the output of two models is identical, because they are both OLS models, cross-validated in the same way.

[0.28592315 0.37367557 0.47972639]
[0.28592315 0.37367557 0.47972639]

回答2:

For reference purpose, if you use the statsmodels formula API and/or use the fit_regularized method, you can modify @David Dale's wrapper class in this way.

import pandas as pd
from sklearn.base import BaseEstimator, RegressorMixin
from statsmodels.formula.api import glm as glm_sm

# This is an example wrapper for statsmodels GLM
class SMWrapper(BaseEstimator, RegressorMixin):
    def __init__(self, family, formula, alpha, L1_wt):
        self.family = family
        self.formula = formula
        self.alpha = alpha
        self.L1_wt = L1_wt
        self.model = None
        self.result = None
    def fit(self, X, y):
        data = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)
        data.columns = X.columns.tolist() + ['y']
        self.model = glm_sm(self.formula, data, family=self.family)
        self.result = self.model.fit_regularized(alpha=self.alpha, L1_wt=self.L1_wt, refit=True)
        return self.result
    def predict(self, X):
        return self.result.predict(X)

来源：https://stackoverflow.com/questions/41045752/using-statsmodel-estimations-with-scikit-learn-cross-validation-is-it-possible

标签

python

scikit-learn

cross-validation

statsmodels