Using scikit to determine contributions of each feature to a specific class prediction

后端 未结 5 1144
粉色の甜心
粉色の甜心 2020-12-02 13:57

I am using a scikit extra trees classifier:

model = ExtraTreesClassifier(n_estimators=10000, n_jobs=-1, random_state=0)

Once the model is f

5条回答
  •  无人及你
    2020-12-02 14:33

    This is modified from the docs

    from sklearn import datasets
    from sklearn.ensemble import ExtraTreesClassifier
    
    iris = datasets.load_iris()  #sample data
    X, y = iris.data, iris.target
    
    model = ExtraTreesClassifier(n_estimators=10000, n_jobs=-1, random_state=0)
    model.fit_transform(X,y) # fit the dataset to your model
    

    I think feature_importances_ is what you're looking for:

    In [13]: model.feature_importances_
    Out[13]: array([ 0.09523045,  0.05767901,  0.40150422,  0.44558631])
    

    EDIT

    Maybe I misunderstood the first time (pre-bounty), sorry, this may be more along the lines of what you are looking for. There is a python library called treeinterpreter that produces the information I think you are looking for. You'll have to use the basic DecisionTreeClassifer (or Regressor). Following along from this blog post, you can discretely access the feature contributions in the prediction of each instance:

    from sklearn import datasets
    from sklearn.cross_validation import train_test_split
    from sklearn.tree import DecisionTreeClassifier
    
    from treeinterpreter import treeinterpreter as ti
    
    iris = datasets.load_iris()  #sample data
    X, y = iris.data, iris.target
    #split into training and test 
    X_train, X_test, y_train, y_test = train_test_split( 
        X, y, test_size=0.33, random_state=0)
    
    # fit the model on the training set
    model = DecisionTreeClassifier(random_state=0)
    model.fit(X_train,y_train)
    

    I'll just iterate through each sample in X_test for illustrative purposes, this almost exactly mimics the blog post above:

    for test_sample in range(len(X_test)):
        prediction, bias, contributions = ti.predict(model, X_test[test_sample].reshape(1,4))
        print "Class Prediction", prediction
        print "Bias (trainset prior)", bias
    
        # now extract contributions for each instance
        for c, feature in zip(contributions[0], iris.feature_names):
            print feature, c
    
        print '\n'
    

    The first iteration of the loop yields:

    Class Prediction [[ 0.  0.  1.]]
    Bias (trainset prior) [[ 0.34  0.31  0.35]]
    sepal length (cm) [ 0.  0.  0.]
    sepal width (cm) [ 0.  0.  0.]
    petal length (cm) [ 0.         -0.43939394  0.43939394]
    petal width (cm) [-0.34        0.12939394  0.21060606]
    

    Interpreting this output, it seems as though petal length and petal width were the most important contributors to the prediction of third class (for the first sample). Hope this helps.

提交回复
热议问题