How to Retrieve Original Variables After Scikit Model Run w/OneHotEncoding

后端 未结 1 1927
悲哀的现实
悲哀的现实 2020-12-18 03:26

I have successfully ran a logistic regression model from the scikit-learn SGDClassifier package but cannot easily interpret the model\'s coefficients (accessed via SGD

相关标签:
1条回答
  • 2020-12-18 03:56

    After reviewing this user's detailed explanation of OneHotEncoder here, I was able to create a (somewhat hack-y) approach to relating model coefficients back to the original data set.

    Assuming you've correctly setup your OneHotEncoder:

    from sklearn.preprocessing import OneHotEncoder
    from scipy import sparse
    
    enc = OneHotEncoder()
    X_OHE = enc.fit_transform(X)   # X and X_OHE as described in question
    

    And you have successfully ran a GLM model, say:

    from sklearn import linear_model
    
    clf = linear_model.SGDClassifier()
    clf.fit(X_train, y_train)
    

    Which has coefficients clf.coef_:

    print clf.coef_
    # np.array([[-1.233e+00,0.9123e+00,-2.431e+00...-0.238e+01,-1.33e+00,0.001e-01]])
    

    You can use the below approach to trace the encoded 1's and 0's in X_OHE back to the original values in X. I'd recommend reading the mentioned detailed explanation on OneHotEncoding (link at top), else the below will seem like gibberish. But in a nutshell, the below iterates over each feature in X_OHE and uses the feature_indices parameter internal to enc to make the translation.

    import pandas as pd
    import numpy as np
    results = []
    
    for i in range(enc.active_features_.shape[0]):
        f = enc.active_features_[i]
    
        index_range = np.extract(enc.feature_indices_ <= f, enc.feature_indices_)
        s = len(index_range) - 1
        f_index = index_range[-1]
        f_label_decoded = f - f_index
    
        results.append({
                'label_decoded_value': f_label_decoded,
                'coefficient': clf.coef_[0][i]
            })
    
    R = pd.DataFrame.from_records(results)
    

    Where R looks like this (I original encoded the names of company departments):

    coefficient label_decoded_value
    3.929413    DepartmentFoo1
    3.718078    DepartmentFoo2
    3.101869    DepartmentFoo3
    2.892845    DepartmentFoo4
    ...
    

    So, now you can say, "The target variables increases by 3.929413 when an employee is in department 'Foo1'.

    0 讨论(0)
提交回复
热议问题