Generating Ngrams (Unigrams,Bigrams etc) from a large corpus of .txt files and their Frequency

后端 未结 6 1716
孤街浪徒
孤街浪徒 2020-12-12 23:12

I need to write a program in NLTK that breaks a corpus (a large collection of txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams. I have already written co

相关标签:
6条回答
  • 2020-12-12 23:18

    Ok, so since you asked for an NLTK solution this might not be exactly what you where looking for but, have you considered TextBlob? It has a NLTK backend but it has a simpler syntax. It would look something like this:

    from textblob import TextBlob
    
    text = "Paste your text or text-containing variable here" 
    blob = TextBlob(text)
    ngram_var = blob.ngrams(n=3)
    print(ngram_var)
    
    Output:
    [WordList(['Paste', 'your', 'text']), WordList(['your', 'text', 'or']), WordList(['text', 'or', 'text-containing']), WordList(['or', 'text-containing', 'variable']), WordList(['text-containing', 'variable', 'here'])]
    

    You would of course still need to use Counter or some other method to add a count per ngram.

    However, the fastest approach (by far) I have been able to find to both create any ngram you'd like and also count in a single function them stems from this post from 2012 and uses Itertools. It's great.

    0 讨论(0)
  • 2020-12-12 23:23

    Here is a simple example using pure Python to generate any ngram:

    >>> def ngrams(s, n=2, i=0):
    ...     while len(s[i:i+n]) == n:
    ...         yield s[i:i+n]
    ...         i += 1
    ...
    >>> txt = 'Python is one of the awesomest languages'
    
    >>> unigram = ngrams(txt.split(), n=1)
    >>> list(unigram)
    [['Python'], ['is'], ['one'], ['of'], ['the'], ['awesomest'], ['languages']]
    
    >>> bigram = ngrams(txt.split(), n=2)
    >>> list(bigram)
    [['Python', 'is'], ['is', 'one'], ['one', 'of'], ['of', 'the'], ['the', 'awesomest'], ['awesomest', 'languages']]
    
    >>> trigram = ngrams(txt.split(), n=3)
    >>> list(trigram)
    [['Python', 'is', 'one'], ['is', 'one', 'of'], ['one', 'of', 'the'], ['of', 'the', 'awesomest'], ['the', 'awesomest',
    'languages']]
    
    0 讨论(0)
  • 2020-12-12 23:27

    Just use ntlk.ngrams.

    import nltk
    from nltk import word_tokenize
    from nltk.util import ngrams
    from collections import Counter
    
    text = "I need to write a program in NLTK that breaks a corpus (a large collection of \
    txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\ 
    I need to write a program in NLTK that breaks a corpus"
    token = nltk.word_tokenize(text)
    bigrams = ngrams(token,2)
    trigrams = ngrams(token,3)
    fourgrams = ngrams(token,4)
    fivegrams = ngrams(token,5)
    
    print Counter(bigrams)
    
    Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2,
     ('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2,
     ('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2,
     ('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams', 
    ','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1,
     (',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of',
     'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1,
     ('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1, 
    ('collection', 'of'): 1, ('files', ')'): 1})
    

    UPDATE (with pure python):

    import os
    
    corpus = []
    path = '.'
    for i in os.walk(path).next()[2]:
        if i.endswith('.txt'):
            f = open(os.path.join(path,i))
            corpus.append(f.read())
    frequencies = Counter([])
    for text in corpus:
        token = nltk.word_tokenize(text)
        bigrams = ngrams(token, 2)
        frequencies += Counter(bigrams)
    
    0 讨论(0)
  • 2020-12-12 23:36

    If efficiency is an issue and you have to build multiple different n-grams, but you want to use pure python I would do:

    from itertools import chain
    
    def n_grams(seq, n=1):
        """Returns an iterator over the n-grams given a list_tokens"""
        shift_token = lambda i: (el for j,el in enumerate(seq) if j>=i)
        shifted_tokens = (shift_token(i) for i in range(n))
        tuple_ngrams = zip(*shifted_tokens)
        return tuple_ngrams # if join in generator : (" ".join(i) for i in tuple_ngrams)
    
    def range_ngrams(list_tokens, ngram_range=(1,2)):
        """Returns an itirator over all n-grams for n in range(ngram_range) given a list_tokens."""
        return chain(*(n_grams(list_tokens, i) for i in range(*ngram_range)))
    

    Usage :

    >>> input_list = input_list = 'test the ngrams generator'.split()
    >>> list(range_ngrams(input_list, ngram_range=(1,3)))
    [('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]
    

    ~Same speed as NLTK:

    import nltk
    %%timeit
    input_list = 'test the ngrams interator vs nltk '*10**6
    nltk.ngrams(input_list,n=5)
    # 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %%timeit
    input_list = 'test the ngrams interator vs nltk '*10**6
    n_grams(input_list,n=5)
    # 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %%timeit
    input_list = 'test the ngrams interator vs nltk '*10**6
    nltk.ngrams(input_list,n=1)
    nltk.ngrams(input_list,n=2)
    nltk.ngrams(input_list,n=3)
    nltk.ngrams(input_list,n=4)
    nltk.ngrams(input_list,n=5)
    # 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %%timeit
    input_list = 'test the ngrams interator vs nltk '*10**6
    range_ngrams(input_list, ngram_range=(1,6))
    # 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Repost from my previous answer.

    0 讨论(0)
  • 2020-12-12 23:36

    The answer of @hellpander above correct, but not efficient for a very large corpus (I faced difficulties with ~650K documents). The code would slow down considerably everytime frequencies are updated, due to the expensive lookup of the dictionary as the content grows. So you will need to have additional buffer variable to help cache the frequencies Counter of @hellpander answer. Hence, isntead of doing key lookup for a very large frequencies (dictionary) everytime a new document is iterated, you would add it to the temporary, smaller Counter dict. Then, after some iterations, it will be add up to the global frequencies. This way it'll be much faster because the huge dictionary lookup is done much less frequently.

    import os
    
    corpus = []
    path = '.'
    for i in os.walk(path).next()[2]:
        if i.endswith('.txt'):
            f = open(os.path.join(path,i))
            corpus.append(f.read())
    frequencies = Counter([])
    
    for i in range(0, len(corpus)):
        token = nltk.word_tokenize(corpus[i])
        bigrams = ngrams(token, 2)
        f += Counter(bigrams)
        if (i%10000 == 0):
            # store to global frequencies counter and clear up f every 10000 docs.
            frequencies += Counter(bigrams)
            f = Counter([])
    
    0 讨论(0)
  • 2020-12-12 23:40

    maybe it helps. see link

    import spacy  
    nlp_en = spacy.load("en_core_web_sm")
    [x.text for x in doc]
    
    0 讨论(0)
提交回复
热议问题