Python: encounter problems in sentence segmenter, word tokenizer, and part-of-speech tagger

问题

I am trying to read text file into Python, and then do sentence segmenter, word tokenizer, and part-of-speech tagger.

This is my code:

file=open('C:/temp/1.txt','r')
sentences = nltk.sent_tokenize(file)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]

When I try just second command, it displayed error:

Traceback (most recent call last):
File "<pyshell#26>", line 1, in <module>
sentences = nltk.sent_tokenize(file)
File "D:\Python\lib\site-packages\nltk\tokenize\__init__.py", line 76, in sent_tokenize
return tokenizer.tokenize(text)
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1217, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1262, in sentences_from_text
sents = [text[sl] for sl in self._slices_from_text(text)]
File "D:\Python\lib\site-packages\nltk\tokenize\punkt.py", line 1269, in _slices_from_text
for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

Another try: When I try just one sentence such as "A yellow dog barked at the cat" The first three commands work, but the last line, I got this error:(I wonder if I didn't download packages completely?)

Traceback (most recent call last):
File "<pyshell#16>", line 1, in <module>
sentences = [nltk.pos_tag(sent) for sent in sentences]
File "D:\Python\lib\site-packages\nltk\tag\__init__.py", line 99, in pos_tag
tagger = load(_POS_TAGGER)
File "D:\Python\lib\site-packages\nltk\data.py", line 605, in load
resource_val = pickle.load(_open(resource_url))
ImportError: No module named numpy.core.multiarray

回答1:

Um... are you sure the error is in the second line?

You appear to be using single-quote and comma characters other than the standard ASCII ' and , characters:

file=open(‘C:/temp/1.txt’，‘r’) # your version (WRONG)
file=open('C:/temp/1.txt', 'r') # right

Python shouldn't even be able to compile this. Indeed, when I try it it barfs due to a syntax error.

UPDATE: You posted a corrected version with proper syntax. The error message from the traceback is a pretty straightforward one: the function you're calling seems to expect a chunk of text as its parameter, rather than a file object. Although I don't know anything about NLTK specifically, spending five seconds on Google confirms this.

Try something like this:

file = open('C:/temp/1.txt','r')
text = file.read() # read the contents of the text file into a variable
result1 = nltk.sent_tokenize(text)
result2 = [nltk.word_tokenize(sent) for sent in result1]
result3 = [nltk.pos_tag(sent) for sent in result2]

UPDATE: I renamed sentences to result1/2/3 because of the confusion over what the code is actually doing due to repeatedly overwriting the same variable. This does not affect the semantics, just clarifies that the second line actually has an effect on the final result3.

回答2:

First open the file and then read it:

filename = 'C:/temp/1.txt'
infile = open(filename, 'r')
text = infile.read()

then chain the tools up in nltk as such:

tagged_words = [pos_tag(word_tokenize(i) for i in sent_tokenize(text)]

来源：https://stackoverflow.com/questions/24273662/python-encounter-problems-in-sentence-segmenter-word-tokenizer-and-part-of-sp

标签

python

nltk