问题
I'm fairly new to Python (2.7), so forgive me if this is a ridiculously straightforward question. I wish (i) to extract all the words ending in -ing from a text that has been tokenized with the NLTK library and (ii) to extract the 10 words preceding each word thus extracted. I then wish (iii) to save these to file as a dataframe of two columns that might look something like:
Word PreviousContext
starting stood a moment, as if in a troubled reverie; then
seeming of it retraced our steps. But Elijah passed on, without
purchasing a sharp look-out upon the hands: Bildad did all the
I know how to do (i), but am not sure how to go about doing (ii)-(iii). Any help would be greatly appreciated and acknowledged. So far I have:
>>> import bs4
>>> import nltk
>>> from nltk import word_tokenize
>>> url = "http://www.gutenberg.org/files/766/766-h/766-h.htm"
>>> import urllib
>>> response = urllib.urlopen(url)
>>> raw = response.read().decode('utf8')
>>> tokens = word_tokenize(raw)
>>> for w in tokens:
... if w.endswith("ing"):
... print(w)
...
padding
padding
encoding
having
heading
wearying
dismissing
going
nothing
reading etc etc etc..
回答1:
After the code line:
>>> tokens = word_tokenize(raw)
use the below code to generate words with their context:
>>> context={}
>>> for i,w in enumerate(tokens):
... if w.endswith("ing"):
... try:
... context[w]=tokens[i:i+10] # this try...except is used to pass last 10 words whose context is less than 10 words.
... except: pass
...
>>> fp=open('dataframes','w') # save results in this file
>>> fp.write('Word'+'\t\t'+'PreviousContext\n')
>>> for word in context:
... fp.write(word+'\t\t'+' '.join(context[word])+'\n')
...
>>> fp.close()
>>> fp=open('dataframes','r')
>>> for line in fp.readlines()[:10]: # first 10 lines of generated file
... print line
...
Word PreviousContext
raining raining , and I saw more fog and mud in
bidding bidding him good night , if he were yet sitting
growling growling old Scotch Croesus with great flaps of ears ?
bright-looking bright-looking bride , I believe ( as I could not
hanging hanging up in the shop&mdash ; went down to look
scheming scheming and devising opportunities of being alone with her .
muffling muffling her hands in it , in an unsettled and
bestowing bestowing them on Mrs. Gummidge. She was with him all
adorning adorning , the perfect simplicity of his manner , brought
Two things to note:
- nltk treats punctuations as separate tokens, so punctuations are treated as seperate words.
- I've used dictionary to store words with their context, so the order of words will be irrelevant but it is guaranteed that all words with their context are present.
回答2:
Say you have all your words in a list of words:
>>> words
['abc', 'def', 'gdi', 'asd', 'ew', 'd', 'ew', 'fdsa', 'dsa', 'aing', 'e', 'f', 'dsa', 'fe', 'dfa', 'e', 'd', 'fe', 'asd', 'fe', 'ting']
I would put them into a series and grab indices of relevant words:
words = pandas.Series(words)
idx = pandas.Series(words[words.apply(lambda x: x.endswith('ing'))].index)
>>> idx
0 9
1 20
dtype: int64
Now the values of idx
are the indices of words ending in 'ing'
in our original Series
. Next we need to turn these values into ranges:
starts = idx - 10
ends = idx
Now we can index into the original series with these ranges (first though, clip with a lower bound of 0, in case an 'ing'
word appears less than 10 words into a list):
starts = starts.clip(0)
df = pandas.DataFrame([{
'word': words[e],
'Previous':string.join(words[s:e])} for s,e in zip(starts,ends)])
>>> df
Previous word
0 abc def gdi asd ew d ew fdsa dsa aing
1 e f dsa fe dfa e d fe asd fe ting
Not exactly a one liner, but it works.
note the reason 'aing'
only has 9 words in the corresponding column is because it appeared too early in the fake list I made.
回答3:
If you are asking how to do this algorithmically, to begin I would maintain a queue of the previous 10 words at all times and a dataframe where the first column is words ending in 'ing' and the second column is queues of the 10 words preceding the corresponding word (in the first column).
So at the start of your program, the queue would be empty then for the first 10 words, it would enqueue each word. Then each time before moving forward in your loop, enqueue your current word and dequeue a word (making sure to maintain a queue of size 10).
This way, at each iteration you check if the word ends with 'ing'. If it does, add a row to your dataframe where the word is the first item and the second item is the current state of the queue.
In the end, you should have a dataframe with a first column of words which end in 'ing' and its corresponding second column being the 10 words preceding it.
来源:https://stackoverflow.com/questions/26936345/extracting-a-word-and-its-prior-10-word-context-to-a-dataframe-in-python