问题
I am trying to parse book-length blocks of text with StanfordNLP. The http requests work great, but there is a non-configurable 100KB limit to the text length, MAX_CHAR_LENGTH in StanfordCoreNLPServer.java.
For now, I am chopping up the text before I send it to the server, but even if I try to split between sentences and paragraphs, there is some useful coreference information that gets lost between these chunks. Presumably, I could parse chunks with large overlap and link them together, but that seems (1) inelegant and (2) like quite a bit of maintenance.
Is there a better way to configure the server or the requests to either remove the manual chunking or preserve the information across chunks?
BTW, I am POSTing using the python requests module, but I doubt that makes a difference unless a corenlp python wrapper deals with this problem somehow.
回答1:
You should be able to start the server with the flag -maxCharLength -1
and that'll get rid of the sentence length limit. Note that this is inadvisable in production: arbitrarily large documents can consume arbitrarily large amounts of memory (and time), especially with things like coref.
The list of options to the server should be accessible by calling the server with -help
, and are documented in code here.
来源:https://stackoverflow.com/questions/46678204/how-to-work-around-100k-character-limit-for-the-stanfordnlp-server