Configure Tesseract with solr 6.4.1

ぃ、小莉子 提交于 2020-06-28 06:30:18

问题


How to configure Tika OCR with solr 6.4.1. I indexed documents including PDF, images and MS office documents but problem was occurred Tika was not extracting text from images and also from images which are inside PDF and MS office documents. for this I researched Tika OCR is used. for this purpose i am installing tika-app-1.7.jar and Tesseract but i don't know how to configure them with my solr core.


回答1:


You don't need to do anything special. Simply get the Tesseract OCR setup for your distro and install it on the system. Make sure your PATH variable has an entry for the Tesseract home directory, and the TESSDATA_PREFIX variable is set and also points to the Tesseract home directory. Restart Solr and you're good to go. You should be able to see the OCR component when you push documents to the index through the /update/extract handler.

By default, Tesseract only ships with the English model. Get models for other languages from here.



来源:https://stackoverflow.com/questions/43017921/configure-tesseract-with-solr-6-4-1

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!