OpenNLP: Unable to locate the model file for Lemmatizer

杀马特。学长 韩版系。学妹 提交于 2020-12-12 18:18:46

问题


Summary: Unable to find the model file used for Lemmatizer (english-lemmatizer.bin)

Details: OpenNLP Tools Models appears to be a comprehensive repository for the various models used by the different components of the Apache OpenNLP library. However, I am unable to find the model file en-lemmatizer.bin, which is used with the lemmatizer. The Apache OpenNLP Developer Manual provides the following code snippet for the Lemmatization step:

InputStream dictLemmatizer = null;

try (dictLemmatizer = new FileInputStream("english-lemmatizer.bin")) {

}

However, unlike other model files, I am just not able to find the location of this model file. Any pointers would be appreciated.


回答1:


The book "Natural Language Processing with Java Cookbook' by Richard M. Reese provides a good answer. For some reason en-lemmatizer.bin is not available for direct download from the web, but it can be created using the following steps:

  1. Download and untar apache-opennlp-1.9.0-bin.tar (https://opennlp.apache.org/download.html)

  2. Go to the URL for the Lemmatizer Training File and save the text content as en-lemmatizer.dict

  3. Go to the bin directory (from step 1, after untarring) and execute the following command:

opennlp LemmatizerTrainerME -model en-lemmatizer.bin -lang en -data /path/to/en-lemmatizer.dict -encoding UTF-8


Note: Be prepared to handle the following error:

Computing event counts... Exception in thread "main" java.lang.OutOfMemoryError: Java heap space




回答2:


You want en-lemmatizer.bin and not english-lemmatizer.txt



来源:https://stackoverflow.com/questions/55391121/opennlp-unable-to-locate-the-model-file-for-lemmatizer

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!