问题
while learning Gate, I encountered the following problem:
Minipar throws exception when it sees uncommen characters like Ö, Ü, Ä.
For example in the sentence "Batten disease (also known as Spielmeyer-Vogt-Sjögren-Batten disease ) is a rare, fatal autosomal recessive neurodegenerative disorder that begins in childhood." (from a wiki article) The annotation Minipar got before it stopped working is "Batten disease (also known as Spielmeyer-Vogt-Sj" which is exactly before the character ö, so this makes me guessing that this is a case worth attention while using Gate. Because the same pipeline processed several other articles like a breeze.
In Messages Tab, it reprots:
gate.util.InvalidOffsetException
at gate.annotation.AnnotationSetImpl.getNodes(AnnotationSetImpl.java:773)
at gate.annotation.AnnotationSetImpl.add(AnnotationSetImpl.java:802)
at minipar.Minipar.runMinipar(Minipar.java:419)
at minipar.Minipar.execute(Minipar.java:527)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:154)
at gate.creole.SerialController.executeImpl(SerialController.java:153)
at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:129)
at gate.creole.AbstractController.execute(AbstractController.java:75)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1619)
at java.lang.Thread.run(Unknown Source)
gate.creole.ExecutionException: gate.util.InvalidOffsetException
at minipar.Minipar.runMinipar(Minipar.java:491)
at minipar.Minipar.execute(Minipar.java:527)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:154)
at gate.creole.SerialController.executeImpl(SerialController.java:153)
at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:129)
at gate.creole.AbstractController.execute(AbstractController.java:75)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1619)
at java.lang.Thread.run(Unknown Source)
Caused by: gate.util.InvalidOffsetException
at gate.annotation.AnnotationSetImpl.getNodes(AnnotationSetImpl.java:773)
at gate.annotation.AnnotationSetImpl.add(AnnotationSetImpl.java:802)
at minipar.Minipar.runMinipar(Minipar.java:419)
... 9 more
gate.creole.ExecutionException: Document doesn't have sentence annotations. please run tokenizer, sentence splitter and then Minipar
at minipar.Minipar.saveGateSentences(Minipar.java:194)
at minipar.Minipar.execute(Minipar.java:525)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:154)
at gate.creole.SerialController.executeImpl(SerialController.java:153)
at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:129)
at gate.creole.AbstractController.execute(AbstractController.java:75)
at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1619)
at java.lang.Thread.run(Unknown Source)
I'd to thank Ian for his warm support once again.
Matt
回答1:
This appears to be an encoding-related issue of some sort, but unfortunately I can't do any debugging myself as the minipar parser binary no longer appears to be available from the usual download page - I get a small (less than 2kB) greyscale JPEG image instead of a multi-MB .tgz.
There's a few things you could try off the top of my head. The GATE Minipar wrapper writes the input file for the parser and reads the parser's output using whatever is the default encoding on the system where you're running. My speculation is that the parser is producing its output in a different encoding (possibly related to the encoding of the original training data?).
The GATE wrapper writes its input to a temporary file which you should be able to find in your temporary directory as long as you leave GATE Developer running in the background (the temp files are deleted when Developer exits). I would try running minipar-windows.exe on that file from the command line and seeing what the output looks like
C:\path\to\minipar-windows.exe -p C:\path\to\minipar\data -file GATESentencesNNNNNN.txt
The output may give you a clue as to what's failing. If it looks right and you can determine the encoding it's trying to use you could set your GATE Developer to use that as its default encoding (if you're using gate.exe
to start it then you do this by adding a line -Dfile.encoding=ISO-8859-1
or whatever to gate.l4j.ini
) and see if that helps. If so we can consider adding a parameter to the PR to specify the encoding to use when exchanging data with the parser executable.
来源:https://stackoverflow.com/questions/15598773/gate-how-to-let-minipar-play-with-special-characters-like-%c3%96-%c3%9c-%c3%84