How to read corpus of parsed sentences using NLTK in python?

旧街凉风 提交于 2019-12-23 01:45:32

问题


I am working with the BLLIP 1987-89 WSJ Corpus Release 1 (https://catalog.ldc.upenn.edu/LDC2000T43).

I am trying to use NLTK's SyntaxCorpusReader class to read in the parsed sentences. I'm trying to get it to work with a simple example of just 1 file. Here is my code...

from nltk.corpus.reader import SyntaxCorpusReader

path = '/corpus/wsj'
filename = 'wsj1'
reader = SyntaxCorpusReader('/corpus/wsj','wsj1')

I am able to see the raw text from the file. It returns a string of the parsed sentences.

reader.raw()
u"(S1 (S (PP-LOC (IN In)\n\t(NP (NP (DT a) (NN move))\n\t (SBAR (WHNP#0 (WDT that))\n\t  (S (NP-SBJ (-NONE- *T*-0))\n\t   (VP (MD would)\n\t    (VP (VB represent)\n\t     (NP (NP (DT a) (JJ major) (NN break))\n\t      (PP (IN with) (NP (NN tradition))))\n\t     (PP-LOC (IN in)\n\t      (NP#1004 (DT the) (JJ legal) (NN profession)))))))))\n     (, ,)\n     (NP-SBJ#1005 (NP (NN law) (NNS firms))\n      (PP-LOC (IN in) (NP#1006 (DT this) (NN city))))\n     (VP (MD may)\n      (VP (VB become)\n       (NP (NP (DT the) (JJ first))\n\t(PP-LOC (IN in) (NP (DT the) (NN nation)))\n\t(SBAR (WHNP#1 (-NONE- 0))\n\t (S (NP-SBJ (-NONE- *T*-1))\n\t  (VP (TO to)\n\t   (VP (VB reward)\n\t    (NP#1009 (NNS non-lawyers))\n\t    (PP-MNR-CLR (IN with)\n\t     (NP#1010 (NP (DT the) (VBN cherished) (NN title))\n\t      (PP (IN of) (NP (NN partner))))))))))))\n     (. .)))\n...'

But when I try to get the parsed sentences, I receive an error.

reader.parsed_sents()
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/nltk/compat.py", line 487, in wrapper
return method(self).encode('ascii', 'backslashreplace')
File "/usr/lib/python2.7/dist-packages/nltk/util.py", line 664, in __repr__
for elt in self:
File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 291, in iterate_from
tokens = self.read_block(self._stream)
 File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 430, in _read_parsed_sent_block
return list(filter(None, [self._parse(t) for t in self._read_block(stream)]))
 File "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 378, in _read_block
raise NotImplementedError()
NotImplementedError

I'm not sure what the issue is. My goal was to read in the parsed sentences and use NLTK's tree class to extract the text of the sentences, and perhaps navigate the tree structure.


回答1:


Hah, had me going for a while there. That NotImplementedError is not a bug, it's the NLTK's way of telling you that you're using an incomplete class. SyntaxCorpusReader is an "abstract class", intended as a basis for corpora with specific complex syntax. In your case, you just need to use BracketParseCorpusReader instead:

reader = BracketParseCorpusReader('/corpus/wsj','wsj1')
print(reader.parsed_sents()[0])


来源:https://stackoverflow.com/questions/30600975/how-to-read-corpus-of-parsed-sentences-using-nltk-in-python

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!