问题
I'm trying to train tesseract (adding a new, digit only font) as per the instructions found here: http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
What I've done:
- Created a PDF with sample text, converted to tif, ran
tesseract num.dot.exp0.tif num.dot.exp0 batch.nochop makebox digits. Then edited the generated box file, correcting wrong detections - Ran tesseract on training mode:
tesseract num.dot.exp0.tif num.dot.exp0 nobatch box.trainand extracted the unicharset withunicharset_extractor num.dot.exp0.box - Created the font_properties file:
echo "num.dot.exp0 0 0 0 0 0" > font_properties
Everything was OK so far, the .box and unicharset files are correct, num.dot.exp0.tr was generated.
Then I ran shapeclustering -F font_properties -U unicharset num.dot.exp0.tr and got the following error:
Reading num.dot.exp0.tr ...
*** glibc detected *** shapeclustering: double free or corruption (!prev): 0x098c52e0 ***
======= Backtrace: =========
/lib/i386-linux-gnu/libc.so.6(+0x75ee2)[0x82eee2]
/usr/lib/i386-linux-gnu/libstdc++.so.6(_ZdlPv+0x1f)[0x77d51f]
/usr/lib/i386-linux-gnu/libstdc++.so.6(_ZdaPv+0x1b)[0x77d57b]
shapeclustering(_ZN13GenericVectorIiE5clearEv+0x8b)[0x8050949]
shapeclustering(_ZN13GenericVectorIiED1Ev+0x2b)[0x805056b]
/usr/lib/libtesseract.so.3(_ZN9tesseract17TrainingSampleSet14SetupFontIdMapEv+0x137)[0x488699]
/usr/lib/libtesseract.so.3(_ZN9tesseract17TrainingSampleSet22OrganizeByFontAndClassEv+0x22)[0x48823c]
/usr/lib/libtesseract.so.3(_ZN9tesseract13MasterTrainer24ReplaceFragmentedSamplesEv+0x1d7)[0x477ebd]
/usr/lib/libtesseract.so.3(_ZN9tesseract13MasterTrainer15PostLoadCleanupEv+0x47)[0x47587b]
shapeclustering[0x804e2b9]
shapeclustering(main+0x5f)[0x804cb13]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7d24d3]
shapeclustering[0x804ca21]
(...)
00cba000-00cc1000 rw-p 0039c000 08:01 4465015 /usr/lib/libtesseract.so.3.0.2
00cc1000-00d5c000 rw-p 00000000 00:00 0
00ef8000-00f22000 r-xp 00000000 08:01 4211867 /lib/i386-linux-gnu/libm-2.15.so
00f22000-00f23000 r--p 00029000 08:01 4211867 /lib/i386-linux-gnu/libm-2.15.so
00f23000-00f24000 rw-p 0002a000 08:01 4211867 /lib/i386-linux-gnu/libm-2.15.so
08048000-08056000 r-xp 00000000 08:01 4464615 /usr/bin/shapeclustering
08056000-08057000 r--p 0000d000 08:01 4464615 /usr/bin/shapeclustering
08057000-08058000 rw-p 0000e000 08:01 4464615 /usr/bin/shapeclustering
093c5000-094cf000 rw-p 00000000 00:00 0 [heap]
b779a000-b77a0000 rw-p 00000000 00:00 0
b77b6000-b77ba000 rw-p 00000000 00:00 0
bfb6c000-bfb8d000 rw-p 00000000 00:00 0 [stack]
Aborted (core dumped)
Then an empty shapetable is created.
Have I done something wrong? Any clues as to why this is happening?
I'm using tesseract 3.02
回答1:
I managed to find out the problem. I should have used echo "dot 0 0 0 0 0" > font_properties instead of echo "num.dot.exp0 0 0 0 0 0" > font_properties
shapeclustering worked properly after that. It needs the real font name on font_properties, not the complete name ("dot", in my case).
回答2:
I was getting same issue but found solution by verifying font name in font_properties file should be same as in eng.Imagefile.tr.
echo "NewFont 0 0 0 0 0" > font_properties
shapeclustering -F font_properties -U unicharset eng.Imagefile.tr
mftraining -F font_properties -U unicharset -O eng.unicharset eng.Imagefile.tr
来源:https://stackoverflow.com/questions/16471909/training-tesseract-shapeclustering-issue