Efficiently count word frequencies in python

前端 未结 8 1165
走了就别回头了
走了就别回头了 2020-11-29 04:33

I\'d like to count frequencies of all words in a text file.

>>> countInFile(\'test.txt\')

should return {\'aaa\':1, \'bbb\':

8条回答
  •  失恋的感觉
    2020-11-29 04:57

    Combining every ones else's views and some of my own :) Here is what I have for you

    from collections import Counter
    from nltk.tokenize import RegexpTokenizer
    from nltk.corpus import stopwords
    
    text='''Note that if you use RegexpTokenizer option, you lose 
    natural language features special to word_tokenize 
    like splitting apart contractions. You can naively 
    split on the regex \w+ without any need for the NLTK.
    '''
    
    # tokenize
    raw = ' '.join(word_tokenize(text.lower()))
    
    tokenizer = RegexpTokenizer(r'[A-Za-z]{2,}')
    words = tokenizer.tokenize(raw)
    
    # remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    
    # count word frequency, sort and return just 20
    counter = Counter()
    counter.update(words)
    most_common = counter.most_common(20)
    most_common
    

    Output

    (All ones)

    [('note', 1),
     ('use', 1),
     ('regexptokenizer', 1),
     ('option', 1),
     ('lose', 1),
     ('natural', 1),
     ('language', 1),
     ('features', 1),
     ('special', 1),
     ('word', 1),
     ('tokenize', 1),
     ('like', 1),
     ('splitting', 1),
     ('apart', 1),
     ('contractions', 1),
     ('naively', 1),
     ('split', 1),
     ('regex', 1),
     ('without', 1),
     ('need', 1)]
    

    One can do better than this in terms of efficiency but if you are not worried about it too much, this code is the best.

提交回复
热议问题