Tesseract 3.x multiprocessing weird behaviour

浪尽此生 提交于 2019-12-04 12:28:56

问题


I am not sure whether it is my infrastucture that does this weird stuff or the tesseract-ocr itself.

Whenever i use image_to_stirng in single-process environment - the tesseract-ocr works fine. But when I spawn multiple workers with gunicorn and all of them get to do some work with ocr reading - the tesseract-ocr starts reading very poorly (and not from performance-vise, but accuracy-vise). Even after the load is done - tesseract never has the same accuracy. I need to restart all the workers in order to get tesseract working well again.

This is super weird. Maybe anyone has expirienced or heard of this issue ?


回答1:


(NOTE the info below is based on review of the pytesseract.py code, I haven't tried to set up a multi-process test to check)

There are several Python libraries that interface with tesseract-ocr. You are probably using pytesseract (guessing by the image_to_string function).

This library calls the tesseract-ocr binary as a subprocess and uses temporary files to interface to it. It uses the obsolete tempfile.mktemp() which does not guarantee unique file names - further, it does not even use the returned file name as-is, so a second call to tempfile.mktemp() can easily return the same file name.

Consider using a different python interface library for tesseract: e.g., pip install tesseract-ocr or python-tesseract from Google (https://code.google.com/archive/p/python-tesseract/).

(if the problem is actually with the temp files, as I suspect) you may be able to work around this by setting a different temp directory for each of your spawned worker processes:

td = tempfile.mkdtemp()
tempfile.tempdir = td
try:
    # your-code-calling pytesseract.image_to_string() or similar
finally:
    os.rmdir(td)
    tempfile.tempdir = None


来源:https://stackoverflow.com/questions/52046331/tesseract-3-x-multiprocessing-weird-behaviour

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!