I have successfully ran a logistic regression model from the scikit-learn SGDClassifier package but cannot easily interpret the model\'s coefficients (accessed via SGD
After reviewing this user's detailed explanation of OneHotEncoder
here, I was able to create a (somewhat hack-y) approach to relating model coefficients back to the original data set.
Assuming you've correctly setup your OneHotEncoder
:
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
enc = OneHotEncoder()
X_OHE = enc.fit_transform(X) # X and X_OHE as described in question
And you have successfully ran a GLM model, say:
from sklearn import linear_model
clf = linear_model.SGDClassifier()
clf.fit(X_train, y_train)
Which has coefficients clf.coef_
:
print clf.coef_
# np.array([[-1.233e+00,0.9123e+00,-2.431e+00...-0.238e+01,-1.33e+00,0.001e-01]])
You can use the below approach to trace the encoded 1's and 0's in X_OHE
back to the original values in X
. I'd recommend reading the mentioned detailed explanation on OneHotEncoding
(link at top), else the below will seem like gibberish. But in a nutshell, the below iterates over each feature
in X_OHE
and uses the feature_indices
parameter internal to enc
to make the translation.
import pandas as pd
import numpy as np
results = []
for i in range(enc.active_features_.shape[0]):
f = enc.active_features_[i]
index_range = np.extract(enc.feature_indices_ <= f, enc.feature_indices_)
s = len(index_range) - 1
f_index = index_range[-1]
f_label_decoded = f - f_index
results.append({
'label_decoded_value': f_label_decoded,
'coefficient': clf.coef_[0][i]
})
R = pd.DataFrame.from_records(results)
Where R looks like this (I original encoded the names of company departments):
coefficient label_decoded_value
3.929413 DepartmentFoo1
3.718078 DepartmentFoo2
3.101869 DepartmentFoo3
2.892845 DepartmentFoo4
...
So, now you can say, "The target variables increases by 3.929413 when an employee is in department 'Foo1'.