How to speed up tesseract OCR

╄→尐↘猪︶ㄣ 提交于 2019-12-08 01:42:45

问题


I'm trying to OCR a lot of documents(I mean in 300k + range a day). At the moment i'm using Tesseract wrapper for .NET and it's all good in quality but the speed is not good enough. The times i get for 20 tasks in parallel scanning of a half page from the same pdf in average are 2,546 second per scan. The code im using:

using (var engine = new TesseractEngine(Tessdata, "eng", EngineMode.TesseractOnly))
        {
            Page page;
            page = engine.Process(image, srcRect);        
            var text = page.GetText();
            return Task.FromResult(text);
        }

The average time i get is after lowering the resolution of image by half and converting it to grayscale. Any ideas to speed up the process? I don't need to have text segmentated, just the text in one line. Should i maybe use something as Matlab for c#?


回答1:


Currently, you create a new TesseractEngine object for each page you scan. Creating the engine is costly because it reads the 'tessdata' files.

You say you have 20 parallel tasks running. Since the engine cannot process multiple pages at once you will need to create one engine per task and reuse it for all the pages that task processes. You can simply call using (var page = Engine.Process(pix)) to process the next page with an existing engine.

Reusing the engine should significantly improve performance because you'll only have to create 20 engines instead of 300k.



来源:https://stackoverflow.com/questions/44322767/how-to-speed-up-tesseract-ocr

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!