How to read corpus of parsed sentences using NLTK in python?

问题

I am working with the BLLIP 1987-89 WSJ Corpus Release 1 (https://catalog.ldc.upenn.edu/LDC2000T43).

I am trying to use NLTK's SyntaxCorpusReader class to read in the parsed sentences. I'm trying to get it to work with a simple example of just 1 file. Here is my code...

from nltk.corpus.reader import SyntaxCorpusReader

path = '/corpus/wsj'
filename = 'wsj1'
reader = SyntaxCorpusReader('/corpus/wsj','wsj1')

I am able to see the raw text from the file. It returns a string of the parsed sentences.

reader.raw()
u"(S1 (S (PP-LOC (IN In)\n\t(NP (NP (DT a) (NN move))\n\t (SBAR (WHNP#0 (WDT that))\n\t  (S (NP-SBJ (-NONE- *T*-0))\n\t   (VP (MD would)\n\t    (VP (VB represent)\n\t     (NP (NP (DT a) (JJ major) (NN break))\n\t      (PP (IN with) (NP (NN tradition))))\n\t     (PP-LOC (IN in)\n\t      (NP#1004 (DT the) (JJ legal) (NN profession)))))))))\n     (, ,)\n     (NP-SBJ#1005 (NP (NN law) (NNS firms))\n      (PP-LOC (IN in) (NP#1006 (DT this) (NN city))))\n     (VP (MD may)\n      (VP (VB become)\n       (NP (NP (DT the) (JJ first))\n\t(PP-LOC (IN in) (NP (DT the) (NN nation)))\n\t(SBAR (WHNP#1 (-NONE- 0))\n\t (S (NP-SBJ (-NONE- *T*-1))\n\t  (VP (TO to)\n\t   (VP (VB reward)\n\t    (NP#1009 (NNS non-lawyers))\n\t    (PP-MNR-CLR (IN with)\n\t     (NP#1010 (NP (DT the) (VBN cherished) (NN title))\n\t      (PP (IN of) (NP (NN partner))))))))))))\n     (. .)))\n...'

But when I try to get the parsed sentences, I receive an error.

reader.parsed_sents()
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/nltk/compat.py", line 487, in wrapper
return method(self).encode('ascii', 'backslashreplace')
File "/usr/lib/python2.7/dist-packages/nltk/util.py", line 664, in __repr__
for elt in self:
File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 291, in iterate_from
tokens = self.read_block(self._stream)
 File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 430, in _read_parsed_sent_block
return list(filter(None, [self._parse(t) for t in self._read_block(stream)]))
 File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 378, in _read_block
raise NotImplementedError()
NotImplementedError

I'm not sure what the issue is. My goal was to read in the parsed sentences and use NLTK's tree class to extract the text of the sentences, and perhaps navigate the tree structure.

回答1:

Hah, had me going for a while there. That NotImplementedError is not a bug, it's the NLTK's way of telling you that you're using an incomplete class. SyntaxCorpusReader is an "abstract class", intended as a basis for corpora with specific complex syntax. In your case, you just need to use BracketParseCorpusReader instead:

reader = BracketParseCorpusReader('/corpus/wsj','wsj1')
print(reader.parsed_sents()[0])

来源：https://stackoverflow.com/questions/30600975/how-to-read-corpus-of-parsed-sentences-using-nltk-in-python

标签

python

nltk

corpus