Stanford POS Tagger: How to preserve newlines in the output?

不问归期 提交于 2019-12-10 12:03:51

问题


My input.txt file contains the following sample text:

you have to let's
come and see me.

Now if I invoke the Stanford POS tagger with the default command:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -textFile input.txt > output.txt

I get the following in my output.txt file:

you_PRP have_VBP to_TO let_VB 's_POS come_VB and_CC see_VB me_PRP ._.

The problem with the above output is that I have lost my original newline delimiter used in the input file.

Now, if I use the following command to preserve my newline sentence delimiter in the output file I have to set -tokenize option to false:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -tokenize false -textFile input.txt > output.txt 

The problem with this code is that it totally messed up the output:

you_PRP have_VBP to_TO let's_NNS  
come_VB and_CC see_VB me._NN

Here let's and me. are tagged inappropriately.

My question is how can I preserve the newline delimiters in the output file without messing up the tokenization?


回答1:


The answer should have been to use the command:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -textFile input.txt > output.txt 

But there was a bug and it didn't work (ignored the newlines) in version 3.1.3 (and perhaps all earlier versions). It will work in version 3.1.4+.

In the meantime, if the amount of text is small, you might try using the Stanford Parser (where the corresponding flag is named differently so it's -sentences newline).




回答2:


One thing you can do is use xml input instead of plain text. Your input in that case will be:

<xml version="1.0" encoding="UTF-8">
<text>
    <line>you have to let's</line>
    <line>come and see me.</line>
</text>

Here each line is enclosed in a line tag. You can now issue the following command:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -xmlInput line -textFile sample.xml > ouput.xml

Note that the argument '-xmlInput' specifies the tag used for POS tagging. In our case, this tag is line. When you run the above command the output will be:

<?xml version="1.0" encoding="UTF-8"?>
<text>
    <line>
        you_PRP have_VBP to_TO let_VB &apos;s_POS 
    </line>
    <line>
        come_VB and_CC see_VB me_PRP ._. 
    </line>
</text>

Thus you can separate out your lines by reading content enclosed in the line tags.



来源:https://stackoverflow.com/questions/12140683/stanford-pos-tagger-how-to-preserve-newlines-in-the-output

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!