OpenNLP POSTagger output from command line

半城伤御伤魂 提交于 2019-12-02 13:18:19

问题


I want to use OpenNLP in order to tokenize Thai words. I downloaded OpenNLP and Thai tokenize model and run the following

./bin/opennlp POSTagger -lang th -model thai.tok.bin < sentence.txt > output.txt

I put thai.tok.bin that I downloaded on the directory that I call from and run the following. sentence.txt has this text inside กินอะไรยังนาย. However, the output I got has only these text:

Usage: opennlp POSTagger model < sentences
Execution time: 0.000 seconds

I'm pretty new to OpenNLP, please let me know if anyone knows how to get output from it.


回答1:


The models from your link are outdated. First you need some manual steps to convert the model.

  1. Download the file thai.tok.bin.gz and extract to an empty folder. Rename the extracted file thai.tok.bin to token.model
  2. In the same folder, create a file named manifest.properties with the following contents:

    Manifest-Version=1.0.  
    Language=th  
    OpenNLP-Version=1.5.0  
    Component-Name=TokenizerME  
    useAlphaNumericOptimization=false  
    
  3. Now you can zip the files, if you are using Linux you can use this command: zip thai.tok.bin token.model manifest.properties

  4. Try your model:

    sh bin/opennlp TokenizerME ~/Downloads/thai-token.bin/thai.tok.bin <  thai_sentence.txt
    
    
    
    Loading Tokenizer model ... done (0,097s)     
    กินอะไร ยังนาย     
    
    
    Average: 333,3 sent/s      
    Total: 1 sent     
    Runtime: 0.003s     
    Execution time: 0,108 seconds 
    

Now that you have the updated tokenizer, you can do similar with the POS Tagger model.

  1. Download the file thai.tag.bin.gz and extract to a empty folder. Rename the extracted file thai.tag.bin to pos.model

  2. In the same folder, create a file named manifest.properties with the following contents:

    Manifest-Version=1.0
    Language=th
    OpenNLP-Version=1.5.0
    Component-Name=POSTaggerME
    
  3. Now you can zip the files, if you are using Linux you can use this command: zip thai.pos.bin pos.model manifest.properties

Finally, we can try the two models combined:

sh bin/opennlp TokenizerME ~/Downloads/thai-token.bin/thai.tok.bin < thai_sentence.txt > thai_tokens.txt
sh bin/opennlp POSTagger ~/Downloads/pt-pos-maxent/thai.pos.bin < thai_tokens.txt

The result is:

กินอะไร_VACT ยังนาย_NCMN

Please, let me know if this is the expected result.



来源:https://stackoverflow.com/questions/43685885/opennlp-postagger-output-from-command-line

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!