NLP in Python: Obtain word names from SelectKBest after vectorizing

前端 未结 2 487
野的像风
野的像风 2021-01-14 11:16

I can\'t seem to find an answer to my exact problem. Can anyone help?

A simplified description of my dataframe (\"df\"): It has 2 columns: one is a bunch of text (\"

2条回答
  •  春和景丽
    2021-01-14 12:09

    After figuring out really what I wanted to do (thanks Daniel) and doing more research, I found a couple other ways to meet my objective.

    Way 1 - https://glowingpython.blogspot.com/2014/02/terms-selection-with-chi-square.html

    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer(lowercase=True,stop_words='english')
    X = vectorizer.fit_transform(df["Notes"])
    
    from sklearn.feature_selection import chi2
    chi2score = chi2(X,df['AboveAverage'])[0]
    
    wscores = zip(vectorizer.get_feature_names(),chi2score)
    wchi2 = sorted(wscores,key=lambda x:x[1]) 
    topchi2 = zip(*wchi2[-20:])
    show=list(topchi2)
    show
    

    Way 2 - This is the way I used because it was the easiest for me to understand and produced a nice output listing the word, chi2 score, and p-value. Another thread on here: Sklearn Chi2 For Feature Selection

    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_selection import SelectKBest, chi2
    
    vectorizer = CountVectorizer(lowercase=True,stop_words='english')
    X = vectorizer.fit_transform(df["Notes"])
    
    y = df['AboveAverage']
    
    # Select 10 features with highest chi-squared statistics
    chi2_selector = SelectKBest(chi2, k=10)
    chi2_selector.fit(X, y)
    
    # Look at scores returned from the selector for each feature
    chi2_scores = pd.DataFrame(list(zip(vectorizer.get_feature_names(), chi2_selector.scores_, chi2_selector.pvalues_)), 
                                           columns=['ftr', 'score', 'pval'])
    chi2_scores
    

提交回复
热议问题