Efficiently count word frequencies in python

前端 未结 8 1195
走了就别回头了
走了就别回头了 2020-11-29 04:33

I\'d like to count frequencies of all words in a text file.

>>> countInFile(\'test.txt\')

should return {\'aaa\':1, \'bbb\':

8条回答
  •  伪装坚强ぢ
    2020-11-29 05:13

    Here's some benchmark. It'll look strange but the crudest code wins.

    [code]:

    from collections import Counter, defaultdict
    import io, time
    
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer
    
    infile = '/path/to/file'
    
    def extract_dictionary_sklearn(file_path):
        with io.open(file_path, 'r', encoding='utf8') as fin:
            ngram_vectorizer = CountVectorizer(analyzer='word')
            X = ngram_vectorizer.fit_transform(fin)
            vocab = ngram_vectorizer.get_feature_names()
            counts = X.sum(axis=0).A1
        return Counter(dict(zip(vocab, counts)))
    
    def extract_dictionary_native(file_path):
        dictionary = Counter()
        with io.open(file_path, 'r', encoding='utf8') as fin:
            for line in fin:
                dictionary.update(line.split())
        return dictionary
    
    def extract_dictionary_paddle(file_path):
        dictionary = defaultdict(int)
        with io.open(file_path, 'r', encoding='utf8') as fin:
            for line in fin:
                for words in line.split():
                    dictionary[word] +=1
        return dictionary
    
    start = time.time()
    extract_dictionary_sklearn(infile)
    print time.time() - start
    
    start = time.time()
    extract_dictionary_native(infile)
    print time.time() - start
    
    start = time.time()
    extract_dictionary_paddle(infile)
    print time.time() - start
    

    [out]:

    38.306814909
    24.8241138458
    12.1182529926
    

    Data size (154MB) used in the benchmark above:

    $ wc -c /path/to/file
    161680851
    
    $ wc -l /path/to/file
    2176141
    

    Some things to note:

    • With the sklearn version, there's an overhead of vectorizer creation + numpy manipulation and conversion into a Counter object
    • Then native Counter update version, it seems like Counter.update() is an expensive operation

提交回复
热议问题