I'm trying to add new fonts to tesseract ocr. I'm following this tutorial but I'm having some problems.
Here's what I've done so far:
Create training document
convert eng.myfont.exp0.pdf eng.myfont.exp0.tif
Train Tesseract
tesseract eng.myfont.exp0.tif eng.myfont.exp0 batch.nochop makebox
This created my eng.myfont.exp0.box file.
I open the file with moshpytt and make sure it was detected correctly.
Feed the box file back into tesseract
tesseract eng.myfont.exp0.tif eng.myfont.exp0.box nobatch box.train.stderr
I have this result:
Tesseract Open Source OCR Engine v3.03 with Leptonica
APPLY_BOXES:
Boxes read from boxfile: 146
Found 146 good blobs.
TRAINING ... Font name = myfont.exp0
Generated training data for 6 words- eng.myfont.exp0.box.tr file and eng.myfont.exp0.box.txt generated
try to detect the Character set used in the box file (this is where I get stuck)
unicharset_extractor *.box
Result:
unicharset_extractor: command not found
I also tred unicharset_extractor eng.myfont.exp0.box
with the same result.
I'm using:
- tesseract 3.03
- leptonica-1.70
- libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
- Ubuntu 14.04.1 LTS
The training tools for Tesseract 3.03 RC were omitted from Ubuntu 14.04. So either fall back to Tesseract 3.02 or upgrade to Ubuntu 14.10, which should have it.
Ok, I googled this for you. Here's the answer:
You need to run all commands in the same folder where are located your input files.
From:
来源:https://stackoverflow.com/questions/26205480/adding-new-fonts-to-tesseract-3