Generating Ngrams (Unigrams,Bigrams etc) from a large corpus of .txt files and their Frequency

后端未结

关注

 6  1744

I need to write a program in NLTK that breaks a corpus (a large collection of txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams. I have already written co

相关标签:

6条回答

傲寒

2020-12-12 23:18
Ok, so since you asked for an NLTK solution this might not be exactly what you where looking for but, have you considered TextBlob? It has a NLTK backend but it has a simpler syntax. It would look something like this:
```
from textblob import TextBlob

text = "Paste your text or text-containing variable here" 
blob = TextBlob(text)
ngram_var = blob.ngrams(n=3)
print(ngram_var)

Output:
[WordList(['Paste', 'your', 'text']), WordList(['your', 'text', 'or']), WordList(['text', 'or', 'text-containing']), WordList(['or', 'text-containing', 'variable']), WordList(['text-containing', 'variable', 'here'])]
```
You would of course still need to use Counter or some other method to add a count per ngram.

However, the fastest approach (by far) I have been able to find to both create any ngram you'd like and also count in a single function them stems from this post from 2012 and uses Itertools. It's great.
0 讨论(0)
发布评论:

提交评论
- 加载中...

终归单人心

2020-12-12 23:23

Here is a simple example using pure Python to generate any ngram:

>>> def ngrams(s, n=2, i=0):
...     while len(s[i:i+n]) == n:
...         yield s[i:i+n]
...         i += 1
...
>>> txt = 'Python is one of the awesomest languages'

>>> unigram = ngrams(txt.split(), n=1)
>>> list(unigram)
[['Python'], ['is'], ['one'], ['of'], ['the'], ['awesomest'], ['languages']]

>>> bigram = ngrams(txt.split(), n=2)
>>> list(bigram)
[['Python', 'is'], ['is', 'one'], ['one', 'of'], ['of', 'the'], ['the', 'awesomest'], ['awesomest', 'languages']]

>>> trigram = ngrams(txt.split(), n=3)
>>> list(trigram)
[['Python', 'is', 'one'], ['is', 'one', 'of'], ['one', 'of', 'the'], ['of', 'the', 'awesomest'], ['the', 'awesomest',
'languages']]

0 讨论(0)

[愿得一人]

2020-12-12 23:27

Just use ntlk.ngrams.

import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter

text = "I need to write a program in NLTK that breaks a corpus (a large collection of \
txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\ 
I need to write a program in NLTK that breaks a corpus"
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
fourgrams = ngrams(token,4)
fivegrams = ngrams(token,5)

print Counter(bigrams)

Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2,
 ('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2,
 ('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2,
 ('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams', 
','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1,
 (',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of',
 'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1,
 ('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1, 
('collection', 'of'): 1, ('files', ')'): 1})

UPDATE (with pure python):

import os

corpus = []
path = '.'
for i in os.walk(path).next()[2]:
    if i.endswith('.txt'):
        f = open(os.path.join(path,i))
        corpus.append(f.read())
frequencies = Counter([])
for text in corpus:
    token = nltk.word_tokenize(text)
    bigrams = ngrams(token, 2)
    frequencies += Counter(bigrams)

0 讨论(0)

暗喜

2020-12-12 23:36

If efficiency is an issue and you have to build multiple different n-grams, but you want to use pure python I would do:

from itertools import chain

def n_grams(seq, n=1):
    """Returns an iterator over the n-grams given a list_tokens"""
    shift_token = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shifted_tokens = (shift_token(i) for i in range(n))
    tuple_ngrams = zip(*shifted_tokens)
    return tuple_ngrams # if join in generator : (" ".join(i) for i in tuple_ngrams)

def range_ngrams(list_tokens, ngram_range=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngram_range) given a list_tokens."""
    return chain(*(n_grams(list_tokens, i) for i in range(*ngram_range)))

Usage :

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngram_range=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~Same speed as NLTK:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngram_range=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Repost from my previous answer.

0 讨论(0)

隐瞒了意图╮

2020-12-12 23:36
The answer of @hellpander above correct, but not efficient for a very large corpus (I faced difficulties with ~650K documents). The code would slow down considerably everytime frequencies are updated, due to the expensive lookup of the dictionary as the content grows. So you will need to have additional buffer variable to help cache the frequencies Counter of @hellpander answer. Hence, isntead of doing key lookup for a very large frequencies (dictionary) everytime a new document is iterated, you would add it to the temporary, smaller Counter dict. Then, after some iterations, it will be add up to the global frequencies. This way it'll be much faster because the huge dictionary lookup is done much less frequently.
```
import os

corpus = []
path = '.'
for i in os.walk(path).next()[2]:
    if i.endswith('.txt'):
        f = open(os.path.join(path,i))
        corpus.append(f.read())
frequencies = Counter([])

for i in range(0, len(corpus)):
    token = nltk.word_tokenize(corpus[i])
    bigrams = ngrams(token, 2)
    f += Counter(bigrams)
    if (i%10000 == 0):
        # store to global frequencies counter and clear up f every 10000 docs.
        frequencies += Counter(bigrams)
        f = Counter([])
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2020-12-12 23:40
maybe it helps. see link
```
import spacy  
nlp_en = spacy.load("en_core_web_sm")
[x.text for x in doc]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...