Efficiently count word frequencies in python

前端 未结 8 1194
走了就别回头了
走了就别回头了 2020-11-29 04:33

I\'d like to count frequencies of all words in a text file.

>>> countInFile(\'test.txt\')

should return {\'aaa\':1, \'bbb\':

8条回答
  •  感动是毒
    2020-11-29 05:06

    A memory efficient and accurate way is to make use of

    • CountVectorizer in scikit (for ngram extraction)
    • NLTK for word_tokenize
    • numpy matrix sum to collect the counts
    • collections.Counter for collecting the counts and vocabulary

    An example:

    import urllib.request
    from collections import Counter
    
    import numpy as np 
    
    from nltk import word_tokenize
    from sklearn.feature_extraction.text import CountVectorizer
    
    # Our sample textfile.
    url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
    response = urllib.request.urlopen(url)
    data = response.read().decode('utf8')
    
    
    # Note that `ngram_range=(1, 1)` means we want to extract Unigrams, i.e. tokens.
    ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
    # X matrix where the row represents sentences and column is our one-hot vector for each token in our vocabulary
    X = ngram_vectorizer.fit_transform(data.split('\n'))
    
    # Vocabulary
    vocab = list(ngram_vectorizer.get_feature_names())
    
    # Column-wise sum of the X matrix.
    # It's some crazy numpy syntax that looks horribly unpythonic
    # For details, see http://stackoverflow.com/questions/3337301/numpy-matrix-to-array
    # and http://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
    counts = X.sum(axis=0).A1
    
    freq_distribution = Counter(dict(zip(vocab, counts)))
    print (freq_distribution.most_common(10))
    

    [out]:

    [(',', 32000),
     ('.', 17783),
     ('de', 11225),
     ('a', 7197),
     ('que', 5710),
     ('la', 4732),
     ('je', 4304),
     ('se', 4013),
     ('на', 3978),
     ('na', 3834)]
    

    Essentially, you can also do this:

    from collections import Counter
    import numpy as np 
    from nltk import word_tokenize
    from sklearn.feature_extraction.text import CountVectorizer
    
    def freq_dist(data):
        """
        :param data: A string with sentences separated by '\n'
        :type data: str
        """
        ngram_vectorizer = CountVectorizer(analyzer='word', tokenizer=word_tokenize, ngram_range=(1, 1), min_df=1)
        X = ngram_vectorizer.fit_transform(data.split('\n'))
        vocab = list(ngram_vectorizer.get_feature_names())
        counts = X.sum(axis=0).A1
        return Counter(dict(zip(vocab, counts)))
    

    Let's timeit:

    import time
    
    start = time.time()
    word_distribution = freq_dist(data)
    print (time.time() - start)
    

    [out]:

    5.257147789001465
    

    Note that CountVectorizer can also take a file instead of a string and there's no need to read the whole file into memory. In code:

    import io
    from collections import Counter
    
    import numpy as np
    from sklearn.feature_extraction.text import CountVectorizer
    
    infile = '/path/to/input.txt'
    
    ngram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 1), min_df=1)
    
    with io.open(infile, 'r', encoding='utf8') as fin:
        X = ngram_vectorizer.fit_transform(fin)
        vocab = ngram_vectorizer.get_feature_names()
        counts = X.sum(axis=0).A1
        freq_distribution = Counter(dict(zip(vocab, counts)))
        print (freq_distribution.most_common(10))
    

提交回复
热议问题