Extracting a word and its prior 10 word context to a dataframe in Python

问题

I'm fairly new to Python (2.7), so forgive me if this is a ridiculously straightforward question. I wish (i) to extract all the words ending in -ing from a text that has been tokenized with the NLTK library and (ii) to extract the 10 words preceding each word thus extracted. I then wish (iii) to save these to file as a dataframe of two columns that might look something like:

Word        PreviousContext 
starting    stood a moment, as if in a troubled reverie; then
seeming     of it retraced our steps. But Elijah passed on, without
purchasing  a sharp look-out upon the hands: Bildad did all the

I know how to do (i), but am not sure how to go about doing (ii)-(iii). Any help would be greatly appreciated and acknowledged. So far I have:

>>> import bs4 
>>> import nltk
>>> from nltk import word_tokenize
>>> url = "http://www.gutenberg.org/files/766/766-h/766-h.htm"
>>> import urllib
>>> response = urllib.urlopen(url)
>>> raw = response.read().decode('utf8')
>>> tokens = word_tokenize(raw)
>>> for w in tokens:
...     if w.endswith("ing"):
...             print(w)
... 
padding
padding
encoding
having
heading
wearying
dismissing
going
nothing
reading etc etc etc..

回答1:

After the code line:

>>> tokens = word_tokenize(raw)

use the below code to generate words with their context:

>>> context={}
>>> for i,w in enumerate(tokens):
...      if w.endswith("ing"):
...         try:
...            context[w]=tokens[i:i+10]  # this try...except is used to pass last 10 words whose context is less than 10 words.
...         except: pass
... 
>>> fp=open('dataframes','w')   # save results in this file
>>> fp.write('Word'+'\t\t'+'PreviousContext\n')
>>> for word in context:
...    fp.write(word+'\t\t'+' '.join(context[word])+'\n')
... 
>>> fp.close()
>>> fp=open('dataframes','r')  
>>> for line in fp.readlines()[:10]: # first 10 lines of generated file
...    print line
... 
Word                PreviousContext
raining             raining , and I saw more fog and mud in
bidding             bidding him good night , if he were yet sitting
growling            growling old Scotch Croesus with great flaps of ears ?
bright-looking      bright-looking bride , I believe ( as I could not
hanging             hanging up in the shop&mdash ; went down to look
scheming            scheming and devising opportunities of being alone with her .
muffling            muffling her hands in it , in an unsettled and
bestowing           bestowing them on Mrs. Gummidge. She was with him all
adorning            adorning , the perfect simplicity of his manner , brought

Two things to note:

nltk treats punctuations as separate tokens, so punctuations are treated as seperate words.
I've used dictionary to store words with their context, so the order of words will be irrelevant but it is guaranteed that all words with their context are present.

回答2:

Say you have all your words in a list of words:

>>> words
['abc', 'def', 'gdi', 'asd', 'ew', 'd', 'ew', 'fdsa', 'dsa', 'aing', 'e', 'f', 'dsa', 'fe', 'dfa', 'e', 'd', 'fe', 'asd', 'fe', 'ting']

I would put them into a series and grab indices of relevant words:

words =  pandas.Series(words)
idx = pandas.Series(words[words.apply(lambda x: x.endswith('ing'))].index)
>>> idx
0     9
1    20
dtype: int64

Now the values of idx are the indices of words ending in 'ing' in our original Series. Next we need to turn these values into ranges:

starts = idx - 10
ends = idx

Now we can index into the original series with these ranges (first though, clip with a lower bound of 0, in case an 'ing' word appears less than 10 words into a list):

starts = starts.clip(0)
df = pandas.DataFrame([{
    'word': words[e], 
    'Previous':string.join(words[s:e])} for s,e in zip(starts,ends)])
>>> df
                           Previous  word
0  abc def gdi asd ew d ew fdsa dsa  aing
1      e f dsa fe dfa e d fe asd fe  ting

Not exactly a one liner, but it works.

note the reason 'aing' only has 9 words in the corresponding column is because it appeared too early in the fake list I made.

回答3:

If you are asking how to do this algorithmically, to begin I would maintain a queue of the previous 10 words at all times and a dataframe where the first column is words ending in 'ing' and the second column is queues of the 10 words preceding the corresponding word (in the first column).

So at the start of your program, the queue would be empty then for the first 10 words, it would enqueue each word. Then each time before moving forward in your loop, enqueue your current word and dequeue a word (making sure to maintain a queue of size 10).

This way, at each iteration you check if the word ends with 'ing'. If it does, add a row to your dataframe where the word is the first item and the second item is the current state of the queue.

In the end, you should have a dataframe with a first column of words which end in 'ing' and its corresponding second column being the 10 words preceding it.

来源：https://stackoverflow.com/questions/26936345/extracting-a-word-and-its-prior-10-word-context-to-a-dataframe-in-python

标签

python

extract