How to Retrieve Original Variables After Scikit Model Run w/OneHotEncoding

后端未结

关注

 1  1927

I have successfully ran a logistic regression model from the scikit-learn SGDClassifier package but cannot easily interpret the model\'s coefficients (accessed via SGD


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2020-12-18 03:56
              
            
            
                                                                       
After reviewing this user's detailed explanation of OneHotEncoder here, I was able to create a (somewhat hack-y) approach to relating model coefficients back to the original data set.

Assuming you've correctly setup your OneHotEncoder:

from sklearn.preprocessing import OneHotEncoder
from scipy import sparse

enc = OneHotEncoder()
X_OHE = enc.fit_transform(X)   # X and X_OHE as described in question


And you have successfully ran a GLM model, say:

from sklearn import linear_model

clf = linear_model.SGDClassifier()
clf.fit(X_train, y_train)


Which has coefficients clf.coef_:

print clf.coef_
# np.array([[-1.233e+00,0.9123e+00,-2.431e+00...-0.238e+01,-1.33e+00,0.001e-01]])


You can use the below approach to trace the encoded 1's and 0's in X_OHE back to the original values in X.  I'd recommend reading the mentioned detailed explanation on OneHotEncoding (link at top), else the below will seem like gibberish.  But in a nutshell, the below iterates over each feature in X_OHE and uses the feature_indices parameter internal to enc to make the translation.

import pandas as pd
import numpy as np
results = []

for i in range(enc.active_features_.shape[0]):
    f = enc.active_features_[i]

    index_range = np.extract(enc.feature_indices_ <= f, enc.feature_indices_)
    s = len(index_range) - 1
    f_index = index_range[-1]
    f_label_decoded = f - f_index

    results.append({
            'label_decoded_value': f_label_decoded,
            'coefficient': clf.coef_[0][i]
        })

R = pd.DataFrame.from_records(results)


Where R looks like this (I original encoded the names of company departments):

coefficient label_decoded_value
3.929413    DepartmentFoo1
3.718078    DepartmentFoo2
3.101869    DepartmentFoo3
2.892845    DepartmentFoo4
...


So, now you can say, "The target variables increases by 3.929413 when an employee is in department 'Foo1'.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复