问题
I'm trying to add new fonts to tesseract ocr. I'm following this tutorial but I'm having some problems.
Here's what I've done so far:
Create training document
convert eng.myfont.exp0.pdf eng.myfont.exp0.tif
Train Tesseract
tesseract eng.myfont.exp0.tif eng.myfont.exp0 batch.nochop makebox
This created my eng.myfont.exp0.box file.
I open the file with moshpytt and make sure it was detected correctly.
Feed the box file back into tesseract
tesseract eng.myfont.exp0.tif eng.myfont.exp0.box nobatch box.train.stderr
I have this result:
Tesseract Open Source OCR Engine v3.03 with Leptonica
APPLY_BOXES:
Boxes read from boxfile: 146
Found 146 good blobs.
TRAINING ... Font name = myfont.exp0
Generated training data for 6 words- eng.myfont.exp0.box.tr file and eng.myfont.exp0.box.txt generated
try to detect the Character set used in the box file (this is where I get stuck)
unicharset_extractor *.box
Result:
unicharset_extractor: command not found
I also tred unicharset_extractor eng.myfont.exp0.box
with the same result.
I'm using:
- tesseract 3.03
- leptonica-1.70
- libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
- Ubuntu 14.04.1 LTS
回答1:
The training tools for Tesseract 3.03 RC were omitted from Ubuntu 14.04. So either fall back to Tesseract 3.02 or upgrade to Ubuntu 14.10, which should have it.
回答2:
Ok, I googled this for you. Here's the answer:
You need to run all commands in the same folder where are located your input files.
From:
- https://code.google.com/p/tesseract-ocr/issues/detail?id=945 and
- https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Background_and_Limitations
来源:https://stackoverflow.com/questions/26205480/adding-new-fonts-to-tesseract-3