Generating Ngrams (Unigrams,Bigrams etc) from a large corpus of .txt files and their Frequency

后端 未结 6 1717
孤街浪徒
孤街浪徒 2020-12-12 23:12

I need to write a program in NLTK that breaks a corpus (a large collection of txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams. I have already written co

6条回答
  •  [愿得一人]
    2020-12-12 23:27

    Just use ntlk.ngrams.

    import nltk
    from nltk import word_tokenize
    from nltk.util import ngrams
    from collections import Counter
    
    text = "I need to write a program in NLTK that breaks a corpus (a large collection of \
    txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\ 
    I need to write a program in NLTK that breaks a corpus"
    token = nltk.word_tokenize(text)
    bigrams = ngrams(token,2)
    trigrams = ngrams(token,3)
    fourgrams = ngrams(token,4)
    fivegrams = ngrams(token,5)
    
    print Counter(bigrams)
    
    Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2,
     ('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2,
     ('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2,
     ('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams', 
    ','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1,
     (',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of',
     'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1,
     ('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1, 
    ('collection', 'of'): 1, ('files', ')'): 1})
    

    UPDATE (with pure python):

    import os
    
    corpus = []
    path = '.'
    for i in os.walk(path).next()[2]:
        if i.endswith('.txt'):
            f = open(os.path.join(path,i))
            corpus.append(f.read())
    frequencies = Counter([])
    for text in corpus:
        token = nltk.word_tokenize(text)
        bigrams = ngrams(token, 2)
        frequencies += Counter(bigrams)
    

提交回复
热议问题