How to get most informative features for scikit-learn classifier for different class?

后端 未结 3 1547
伪装坚强ぢ
伪装坚强ぢ 2020-12-05 12:50

NLTK package provides a method show_most_informative_features() to find the most important features for both class, with output like:

   contai         


        
3条回答
  •  不知归路
    2020-12-05 13:20

    Basically you need:

    def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
        labelid = list(classifier.classes_).index(classlabel)
        feature_names = vectorizer.get_feature_names()
        topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]
    
        for coef, feat in topn:
            print classlabel, feat, coef    
    
    • classifier.classes_ accesses the index of the class labels you have in the classifier

    • vectorizer.get_feature_names() is self-explanatory

    • sorted(zip(classifier.coef_[labelid], feature_names))[-n:] retrieves the coefficient of the classifier for a given class label and then sorts it in ascending order.


    I'm going to use a simple example from https://github.com/alvations/bayesline

    Input file train.txt:

    $ echo """Pošto je EULEX obećao da će obaviti istragu o prošlosedmičnom izbijanju nasilja na sjeveru Kosova, taj incident predstavlja još jedan ispit kapaciteta misije da doprinese jačanju vladavine prava.
    > De todas as provações que teve de suplantar ao longo da vida, qual foi a mais difícil? O início. Qualquer começo apresenta dificuldades que parecem intransponíveis. Mas tive sempre a minha mãe do meu lado. Foi ela quem me ajudou a encontrar forças para enfrentar as situações mais decepcionantes, negativas, as que me punham mesmo furiosa.
    > Al parecer, Andrea Guasch pone que una relación a distancia es muy difícil de llevar como excusa. Algo con lo que, por lo visto, Alex Lequio no está nada de acuerdo. ¿O es que más bien ya ha conseguido la fama que andaba buscando?
    > Vo väčšine golfových rezortov ide o veľký komplex niekoľkých ihrísk blízko pri sebe spojených s hotelmi a ďalšími možnosťami trávenia voľného času – nie vždy sú manželky či deti nadšenými golfistami, a tak potrebujú iný druh vyžitia. Zaujímavé kombinácie ponúkajú aj rakúske, švajčiarske či talianske Alpy, kde sa dá v zime lyžovať a v lete hrať golf pod vysokými alpskými končiarmi.""" > test.in
    

    Code:

    import codecs, re, time
    from itertools import chain
    
    import numpy as np
    
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.naive_bayes import MultinomialNB
    
    trainfile = 'train.txt'
    
    # Vectorizing data.
    train = []
    word_vectorizer = CountVectorizer(analyzer='word')
    trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
    tags = ['bs','pt','es','sr']
    
    # Training NB
    mnb = MultinomialNB()
    mnb.fit(trainset, tags)
    
    def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
        labelid = list(classifier.classes_).index(classlabel)
        feature_names = vectorizer.get_feature_names()
        topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]
    
        for coef, feat in topn:
            print classlabel, feat, coef
    
    
    
    most_informative_feature_for_class(word_vectorizer, mnb, 'bs')
    print 
    most_informative_feature_for_class(word_vectorizer, mnb, 'pt')
    

    [out]:

    bs obećao -4.50534985071
    bs pošto -4.50534985071
    bs prava -4.50534985071
    bs predstavlja -4.50534985071
    bs prošlosedmičnom -4.50534985071
    bs sjeveru -4.50534985071
    bs taj -4.50534985071
    bs vladavine -4.50534985071
    bs će -4.50534985071
    bs da -4.0998847426
    
    pt teve -4.63472898823
    pt tive -4.63472898823
    pt todas -4.63472898823
    pt vida -4.63472898823
    pt de -4.22926388012
    pt foi -4.22926388012
    pt mais -4.22926388012
    pt me -4.22926388012
    pt as -3.94158180767
    pt que -3.94158180767
    

提交回复
热议问题