How do I get word frequency in a corpus using Scikit Learn CountVectorizer?

前端 未结 4 1260
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-23 17:14

I\'m trying to compute a simple word frequency using scikit-learn\'s CountVectorizer.

import pandas as pd
import numpy as np
from sklearn.featur         


        
相关标签:
4条回答
  • 2020-12-23 18:00

    cv_fit.toarray().sum(axis=0) definitely gives the correct result, but it will be much faster to perform the sum on the sparse matrix and then transform it to an array:

    np.asarray(cv_fit.sum(axis=0))
    
    0 讨论(0)
  • 2020-12-23 18:10

    Combining every ones else's views and some of my own :) Here is what I have for you

    from collections import Counter
    from nltk.tokenize import RegexpTokenizer
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    
    text='''Note that if you use RegexpTokenizer option, you lose 
    natural language features special to word_tokenize 
    like splitting apart contractions. You can naively 
    split on the regex \w+ without any need for the NLTK.
    '''
    
    # tokenize
    raw = ' '.join(word_tokenize(text.lower()))
    
    tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
    words = tokenizer.tokenize(raw)
    
    # remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    
    # count word frequency, sort and return just 20
    counter = Counter()
    counter.update(words)
    most_common = counter.most_common(20)
    most_common
    

    #Output (All ones)

    [('note', 1),
     ('use', 1),
     ('regexptokenizer', 1),
     ('option', 1),
     ('lose', 1),
     ('natural', 1),
     ('language', 1),
     ('features', 1),
     ('special', 1),
     ('word', 1),
     ('tokenize', 1),
     ('like', 1),
     ('splitting', 1),
     ('apart', 1),
     ('contractions', 1),
     ('naively', 1),
     ('split', 1),
     ('regex', 1),
     ('without', 1),
     ('need', 1)]
    

    One can do better than this in terms of efficiency but if you are not worried about it too much, this code is the best.

    0 讨论(0)
  • 2020-12-23 18:12

    cv.vocabulary_ in this instance is a dict, where the keys are the words (features) that you've found and the values are indices, which is why they're 0, 1, 2, 3. It's just bad luck that it looked similar to your counts :)

    You need to work with the cv_fit object to get the counts

    from sklearn.feature_extraction.text import CountVectorizer
    
    texts=["dog cat fish","dog cat cat","fish bird", 'bird']
    cv = CountVectorizer()
    cv_fit=cv.fit_transform(texts)
    
    print(cv.get_feature_names())
    print(cv_fit.toarray())
    #['bird', 'cat', 'dog', 'fish']
    #[[0 1 1 1]
    # [0 2 1 0]
    # [1 0 0 1]
    # [1 0 0 0]]
    

    Each row in the array is one of your original documents (strings), each column is a feature (word), and the element is the count for that particular word and document. You can see that if you sum each column you'll get the correct number

    print(cv_fit.toarray().sum(axis=0))
    #[2 3 2 2]
    

    Honestly though, I'd suggest using collections.Counter or something from NLTK, unless you have some specific reason to use scikit-learn, as it'll be simpler.

    0 讨论(0)
  • 2020-12-23 18:14

    We are going to use the zip method to make dict from a list of words and list of their counts

    import pandas as pd
    import numpy as np    
    from sklearn.feature_extraction.text import CountVectorizer
    
    texts=["dog cat fish","dog cat cat","fish bird","bird"]    
    
    cv = CountVectorizer()   
    cv_fit=cv.fit_transform(texts)    
    word_list = cv.get_feature_names();    
    count_list = cv_fit.toarray().sum(axis=0)    
    

    print word_list
    ['bird', 'cat', 'dog', 'fish']
    print count_list
    [2 3 2 2]
    print dict(zip(word_list,count_list))
    {'fish': 2, 'dog': 2, 'bird': 2, 'cat': 3}

    0 讨论(0)
提交回复
热议问题