Using statsmodel estimations with scikit-learn cross validation, is it possible?

后端 未结 4 2125
無奈伤痛
無奈伤痛 2021-01-31 19:05

I posted this question to Cross Validated forum and later realized may be this would find appropriate audience in stackoverlfow instead.

I am looking for a way I can use

4条回答
  •  無奈伤痛
    2021-01-31 19:28

    Following the suggestion of David (which gave me an error, complaining about missing function get_parameters) and the scikit learn documentation, I created the following wrapper for a linear regression. It has the same interface of sklearn.linear_model.LinearRegression but in addition has also the function summary(), which gives the info about p-values, R2 and other statistics, as in statsmodels.OLS.

    import statsmodels.api as sm
    from sklearn.base import BaseEstimator, RegressorMixin
    import pandas as pd
    import numpy as np
    
    from sklearn.utils.multiclass import check_classification_targets
    from sklearn.utils.validation import check_X_y, check_is_fitted, check_array
    from sklearn.utils.multiclass import unique_labels
    from sklearn.utils.estimator_checks import check_estimator
    
    
    
    class MyLinearRegression(BaseEstimator, RegressorMixin):
        def __init__(self, fit_intercept=True):
    
            self.fit_intercept = fit_intercept
    
    
        """
        Parameters
        ------------
        column_names: list
                It is an optional value, such that this class knows 
                what is the name of the feature to associate to 
                each column of X. This is useful if you use the method
                summary(), so that it can show the feature name for each
                coefficient
        """ 
        def fit(self, X, y, column_names=() ):
    
            if self.fit_intercept:
                X = sm.add_constant(X)
    
            # Check that X and y have correct shape
            X, y = check_X_y(X, y)
    
    
            self.X_ = X
            self.y_ = y
    
            if len(column_names) != 0:
                cols = column_names.copy()
                cols = list(cols)
                X = pd.DataFrame(X)
                cols = column_names.copy()
                cols.insert(0,'intercept')
                print('X ', X)
                X.columns = cols
    
            self.model_ = sm.OLS(y, X)
            self.results_ = self.model_.fit()
            return self
    
    
    
        def predict(self, X):
            # Check is fit had been called
            check_is_fitted(self, 'model_')
    
            # Input validation
            X = check_array(X)
    
            if self.fit_intercept:
                X = sm.add_constant(X)
            return self.results_.predict(X)
    
    
        def get_params(self, deep = False):
            return {'fit_intercept':self.fit_intercept}
    
    
        def summary(self):
            print(self.results_.summary() )
    

    Example of use:

    cols = ['feature1','feature2']
    X_train = df_train[cols].values
    X_test = df_test[cols].values
    y_train = df_train['label']
    y_test = df_test['label']
    model = MyLinearRegression()
    model.fit(X_train, y_train)
    model.summary()
    model.predict(X_test)
    

    If you want to show the names of the columns, you can call

    model.fit(X_train, y_train, column_names=cols)
    

    To use it in cross_validation:

    from sklearn.model_selection import cross_val_score
    scores = cross_val_score(MyLinearRegression(), X_train, y_train, cv=10, scoring='neg_mean_squared_error')
    scores
    

提交回复
热议问题