(GATE) How to let Minipar play with special characters like Ö, Ü, Ä?

本秂侑毒 提交于 2019-12-12 03:47:57

问题


while learning Gate, I encountered the following problem:

Minipar throws exception when it sees uncommen characters like Ö, Ü, Ä.

For example in the sentence "Batten disease (also known as Spielmeyer-Vogt-Sjögren-Batten disease ) is a rare, fatal autosomal recessive neurodegenerative disorder that begins in childhood." (from a wiki article) The annotation Minipar got before it stopped working is "Batten disease (also known as Spielmeyer-Vogt-Sj" which is exactly before the character ö, so this makes me guessing that this is a case worth attention while using Gate. Because the same pipeline processed several other articles like a breeze.

In Messages Tab, it reprots:


gate.util.InvalidOffsetException
    at gate.annotation.AnnotationSetImpl.getNodes(AnnotationSetImpl.java:773)
    at gate.annotation.AnnotationSetImpl.add(AnnotationSetImpl.java:802)
    at minipar.Minipar.runMinipar(Minipar.java:419)
    at minipar.Minipar.execute(Minipar.java:527)
    at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
    at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:154)
    at gate.creole.SerialController.executeImpl(SerialController.java:153)
    at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:129)
    at gate.creole.AbstractController.execute(AbstractController.java:75)
    at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
    at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1619)
    at java.lang.Thread.run(Unknown Source)
gate.creole.ExecutionException: gate.util.InvalidOffsetException
    at minipar.Minipar.runMinipar(Minipar.java:491)
    at minipar.Minipar.execute(Minipar.java:527)
    at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
    at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:154)
    at gate.creole.SerialController.executeImpl(SerialController.java:153)
    at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:129)
    at gate.creole.AbstractController.execute(AbstractController.java:75)
    at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
    at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1619)
    at java.lang.Thread.run(Unknown Source)
Caused by: gate.util.InvalidOffsetException
    at gate.annotation.AnnotationSetImpl.getNodes(AnnotationSetImpl.java:773)
    at gate.annotation.AnnotationSetImpl.add(AnnotationSetImpl.java:802)
    at minipar.Minipar.runMinipar(Minipar.java:419)
    ... 9 more
gate.creole.ExecutionException: Document doesn't have sentence annotations. please run tokenizer, sentence splitter and then Minipar
    at minipar.Minipar.saveGateSentences(Minipar.java:194)
    at minipar.Minipar.execute(Minipar.java:525)
    at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
    at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:154)
    at gate.creole.SerialController.executeImpl(SerialController.java:153)
    at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:129)
    at gate.creole.AbstractController.execute(AbstractController.java:75)
    at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
    at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1619)
    at java.lang.Thread.run(Unknown Source)

I'd to thank Ian for his warm support once again.

Matt


回答1:


This appears to be an encoding-related issue of some sort, but unfortunately I can't do any debugging myself as the minipar parser binary no longer appears to be available from the usual download page - I get a small (less than 2kB) greyscale JPEG image instead of a multi-MB .tgz.

There's a few things you could try off the top of my head. The GATE Minipar wrapper writes the input file for the parser and reads the parser's output using whatever is the default encoding on the system where you're running. My speculation is that the parser is producing its output in a different encoding (possibly related to the encoding of the original training data?).

The GATE wrapper writes its input to a temporary file which you should be able to find in your temporary directory as long as you leave GATE Developer running in the background (the temp files are deleted when Developer exits). I would try running minipar-windows.exe on that file from the command line and seeing what the output looks like

C:\path\to\minipar-windows.exe -p C:\path\to\minipar\data -file GATESentencesNNNNNN.txt

The output may give you a clue as to what's failing. If it looks right and you can determine the encoding it's trying to use you could set your GATE Developer to use that as its default encoding (if you're using gate.exe to start it then you do this by adding a line -Dfile.encoding=ISO-8859-1 or whatever to gate.l4j.ini) and see if that helps. If so we can consider adding a parameter to the PR to specify the encoding to use when exchanging data with the parser executable.



来源:https://stackoverflow.com/questions/15598773/gate-how-to-let-minipar-play-with-special-characters-like-%c3%96-%c3%9c-%c3%84

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!