How do I get word frequency in a corpus using Scikit Learn CountVectorizer?

前端 未结 4 1276
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-23 17:14

I\'m trying to compute a simple word frequency using scikit-learn\'s CountVectorizer.

import pandas as pd
import numpy as np
from sklearn.featur         


        
4条回答
  •  忘掉有多难
    2020-12-23 18:10

    Combining every ones else's views and some of my own :) Here is what I have for you

    from collections import Counter
    from nltk.tokenize import RegexpTokenizer
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    
    text='''Note that if you use RegexpTokenizer option, you lose 
    natural language features special to word_tokenize 
    like splitting apart contractions. You can naively 
    split on the regex \w+ without any need for the NLTK.
    '''
    
    # tokenize
    raw = ' '.join(word_tokenize(text.lower()))
    
    tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
    words = tokenizer.tokenize(raw)
    
    # remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    
    # count word frequency, sort and return just 20
    counter = Counter()
    counter.update(words)
    most_common = counter.most_common(20)
    most_common
    

    #Output (All ones)

    [('note', 1),
     ('use', 1),
     ('regexptokenizer', 1),
     ('option', 1),
     ('lose', 1),
     ('natural', 1),
     ('language', 1),
     ('features', 1),
     ('special', 1),
     ('word', 1),
     ('tokenize', 1),
     ('like', 1),
     ('splitting', 1),
     ('apart', 1),
     ('contractions', 1),
     ('naively', 1),
     ('split', 1),
     ('regex', 1),
     ('without', 1),
     ('need', 1)]
    

    One can do better than this in terms of efficiency but if you are not worried about it too much, this code is the best.

提交回复
热议问题