Tesseract: Advantage to Multi-Page Training File vs. Multiple Separate Files?

问题

This SO answer suggests that training tesseract with .tif files has an advantage over .png files because the .tif files can have multiple pages and thus a larger training sample. Yet, this SO question discusses procedures for training with multiple images at once. More so, the man page for, e.g. mftraining suggests that it can accept multiple training files.

Is there any reason then not to train with multiple separate image files?

回答1:

It appears that using multiple images to train tesseract on a single font seems to work just fine. Below is a sketch of the workflow I employ:

# Convert files to .pdf
convert -density 600 Page1.pdf eng1.MyNewFont.exp1.png
convert -density 600 Page2.pdf eng1.MyNewFont.exp2.png

# Create .box files
tesseract eng1.MyNewFont.exp1.png eng1.MyNewFont.exp1 -l eng batch.nochop makebox
tesseract eng1.MyNewFont.exp2.png eng1.MyNewFont.exp2 -l eng batch.nochop makebox

## correct boxes with jTessBoxEditor or another box editor ##

# Create two new box.tr files: eng1.MyNewFont.exp1.box.tr and eng1.MyNewFont.exp2.box.tr

tesseract eng1.MyNewFont.exp1.png eng1.MyNewFont.exp1.box -l eng1 nobatch box.train.stderr
tesseract eng1.MyNewFont.exp2.png eng1.MyNewFont.exp2.box -l eng1 nobatch box.train.stderr

# Extract characters from the two .box files
unicharset_extractor eng1.MyNewFont.exp1.box eng1.MyNewFont.exp2.box 

echo "MyNewFont 0 0 0 0 0" >> font_properties

# train using the two new box.tr files.
mftraining -F font_properties -U unicharset -O eng1.unicharset eng1.MyNewFont.exp1.box.tr eng1.MyNewFont.exp2.box.tr 
cntraining eng1.MyNewFont.exp1.box.tr eng1.MyNewFont.exp2.box.tr

## rename files
mv inttemp  eng1.inttemp
mv normproto  eng1.normproto
mv pffmtable  eng1.pffmtable
mv shapetable  eng1.shapetable

combine_tessdata eng1. ## create .traineddata file.

回答2:

You can certainly train with multiple image files; Tesseract would treat them as having different, separate fonts. And there is a limit (64) on the number of images. If they share a common font, it would be better to put them in a multi-page TIFF. According to its specs, a TIFF file can be a container holding many images.

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract https://en.wikipedia.org/wiki/Tagged_Image_File_Format

来源：https://stackoverflow.com/questions/38018256/tesseract-advantage-to-multi-page-training-file-vs-multiple-separate-files

标签

tesseract