How to Make Tesseract Faster [closed]

问题

This is a long shot, but I have to ask. I need any ideas that might make Tesseract OCR engine faster. I'm processing 2M PDFs consisting of about 20M pages of text, and I need to get every bit of performance that I can. Current estimate is that this will take about a year to complete, if I do nothing.

I've tweaked the input images to get some boosts there, but I need to think about other approaches. I don't think improvements to the images will get me anywhere at this point.

For example:

Can Tesseract be recompiled with optimization flags or something like that?
Can shared CPU memory or GPUs be put into action?
Can I somehow tell Tesseract to use more memory (I've got lots of that)?
Are there any other approaches that make CPU-bound C++ programs faster?

Currently, Tesseract is being run by our task runner, Celery, which uses multi-processing to do its work. This way, I can make the server look like this:

I (obviously?) don't know what I'm talking about because I'm a Python developer and Tesseract is written in C++, but if there's any way to get a boost here, I'd love ideas.

回答1:

I also have huge OCR needs and Tesseract is prohibitively slow. I ended up going for a custom feedforward net similar to this one. You don't have to build it yourself, though; you can use a high-performance library like Nervana neon, which happens to be easy to use.

Then there's two parts to the problem:

1) Separate characters from non-characters.
2) Feed characters to the net.

Let's say you feed characters in batches of size 1000, that you resize each character to dimensions 8 x 8 (64 pixels), and that you want to recognize 26 letters (lowercase AND uppercase) and 10 digits and 10 special characters (72 glyphs total). Then parsing all 1000 characters ends up being two (non-associative!) matrix products:

(A dot B) dot C.

A would be a 1000 x 64 matrix, B would be a 64 x 256 matrix, C would be a 256 x 72 matrix.

For me, this is several orders of magnitude faster than Tesseract. Just benchmark how fast your computer can do those matrix products (the elements are floats).

The matrix products are non-associative because after the first one you have to apply a (cheap) function called a ReLU.

It took me a few months to get this whole enchilada to work from scratch, but OCR was a major part of my project.

Also, segmenting characters is non-trivial. Depending on your PDFs, it can be anything from an easy exercise in computer vision to an open research problem in artificial intelligence.

I'm not claiming this is the easiest or most effective way to do this... This is simply what I did!

来源：https://stackoverflow.com/questions/39300753/how-to-make-tesseract-faster

标签

python

c++

performance

tesseract