extracting n grams from huge text

问题

For example we have following text:

"Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. ..."

I need all possible section of this text respectively, for one word by one word, then two by two, three by three to five to five. like this:

ones : ['Spark', 'is', 'a', 'framework', 'for', 'writing, 'fast', 'distributed', 'programs', ...]

twos : ['Spark is', 'is a', 'a framework', 'framework for', 'for writing' ...]

threes : ['Spark is a', 'is a framework', 'a framework for', 'framework for writing', 'for writing fast', ...]

. . .

fives : ['Spark is a framework for', 'is a framework for writing', 'a framework for writing fast','framework for writing fast distributed', ...]

Please note that the text to be processed is huge text( about 100GB). I need the best solution for this process. May be it should be processed multi thread in parallel.

I don't need whole list at once, it can be streaming.

回答1:

First of all, make sure that you have lines in your file then with no worries you can read it line-by-line (discussed here):

with open('my100GBfile.txt') as corpus:
    for line in corpus:
        sequence = preprocess(line)
        extract_n_grams(sequence)

Let's assume that your corpus doesn't need any special treatment. I guess you can find a suitable treatment for your text, I only want it to be chucked into desirable tokens:

def preprocess(string):
    # do what ever preprocessing that it needs to be done
    # e.g. convert to lowercase: string = string.lower()
    # return the sequence of tokens
    return string.split()

I don't know what do you want to do with n-grams. Lets assume that you want to count them as a language model which fits in your memory (it usually does, but I'm not sure about 4- and 5-grams). The easy way is to use off the shelf nltk library:

from nltk.util import ngrams

lm = {n:dict() for n in range(1,6)}
def extract_n_grams(sequence):
    for n in range(1,6):
        ngram = ngrams(sentence, n)
        # now you have an n-gram you can do what ever you want
        # yield ngram
        # you can count them for your language model?
        for item in ngram:
            lm[n][item] = lm[n].get(item, 0) + 1

回答2:

If you don't need the whole list at once, then the best choice should be to use iterators. Thus, my solution looks like this:

import re
text = "Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API."
word_re = re.compile(r"\w+")
words = [text[word.start():word.end()] for word in word_re.finditer(text)]
ngrams = ((words[k] for k in xrange(j, j + i + 1)) for i in xrange(len(words)) for j in xrange(len(words) - i))
for ngram in ngrams:
    for word in ngram:
        print word,
    print

This gives you all the needed ngrams in the desired order. Note that iterators are inevitable even for the ngrams themselves, as your text is as huge as 500G, and the majority of your "all possible sections" will not fit into your memory.

Note also that in your case you will need to count the length of your text and get the words from it separately, as you will not be able to hold it in memory like in my code sample.

回答3:

This should ideally do it. You can customize the parameters min_len and max_len to suit your needs. Applying a sort function too can give you a good idea as to which n-grams stand out from the others.

import nltk
from nltk.util import *
from nltk.collocations import *

content = "Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. ..."
tokens = nltk.word_tokenize(content)
bgs = everygrams(tokens, min_len=users_minimium, max_len=users_maximum)
fdist_bg = nltk.FreqDist(bgs)
for k,v in fdist_bg.items():
        print k,v

And since you mention parallel execution you can check out the following code snippet using Spark MLib and Python

from pyspark.ml.feature import NGram
wordDataFrame = sqlContext.createDataFrame([
(0, ["Hi", "I", "heard", "about", "Spark"]),
(1, ["I", "wish", "Java", "could", "use", "case", "classes"]),
(2, ["Logistic", "regression", "models", "are", "neat"])], ["label","words"])
ngram = NGram(inputCol="words", outputCol="ngrams")
ngramDataFrame = ngram.transform(wordDataFrame)
for ngrams_label in ngramDataFrame.select("ngrams", "label").take(3):
    print(ngrams_label)

Link to the solution and other Feature Extraction techniques in Spark is here: Spark MLib Feature extraction

Hope it helps. Cheers. :)

回答4:

I've written a C library that does this: https://github.com/adriaan-pelzer/corpusToNgrams

Essentially, the most performant way I could think of doing was this:

Parse through the corpus, character by character.
If you encounter a word end character (space, comma, full stop, etc), add the word to a 1-gram array.
Repeat for n = 2 to maxN:
If the length of the 1-gram array is larger than n, concatenate the last n words from the 1-gram array and add it to the n-gram array.

The above can be implemented in a recursive function, and requires you to only parse the corpus once.

回答5:

Have you written any code? Try googling N-gram generation or looking here: Computing N Grams using Python

Looks like you want to generate 1-grams (a list of the words), up through 5-grams.

I would do each in one pass, then move on to n+1-gram.

来源：https://stackoverflow.com/questions/30676448/extracting-n-grams-from-huge-text

标签

python

performance

nlp

bigdata

text-processing