Is there a way to get the “original” text data for OpenNLP?

霸气de小男生 提交于 2019-12-03 21:32:26
schrieveslaach

Unfortunately, you can't. See this question which has a detailed answer to the same problem.

I think, that is a though problem because when you deal with texts you have often licensing issues. For example, you can not build a corpus on Twitter data and publish it to the community (see this paper for some more information).

Therefore, often companies build domain specific corpora and use them internally. For example, we did in our research project. Therefore, we built a tool (Quick Pad Tagger) to create annotated corpora efficiently (see here).

Ok i think this needs a separate answer. I found the Yago database: http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago//

This database seems to be just fantastic (from the first look). You can download all the tagged data and put it in a database (they already deliver the tools for that).

The next stage is to "refactor" the tagged entities so that opennlp can use it (openNLP uses sth. like this <START:person> Pierre Vinken <END>)

Then you create some text files and train it with the opennlp delivered training tool.

Not 100% sure if this works but i will come back and tell you.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!