Facing set datapath error while using tesseract in java

拜拜、爱过 提交于 2019-12-12 03:02:36

问题


I am using tesseract to recognize text from pdfs and I am facing some weird error. The error is Error opening data file data/tessdata/eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.

Now, I understand the meaning of this error and my path is updated to the parent directory of data folder. But the weird thing is that I don't get this error instantly when I run my code but I get it after recognizing 10-15 pdfs(consistiong of roughly 40 pages each). After I get this error and run my program again from the pdf at which the program stopped before, then I get no errors for another 10-15 pdfs readings.

Now this is weird and I don't understand the reason behind this. If someone have faced this or know reason behind this Please comment.

This is a function I use to recognize text from a particular page of my pdf. I call this function for each page of every pdf.

public static void getOCRDataFromPage(PDDocument pdDoc, int page, ArrayList<Characters> ch)
{
    TessBaseAPI handle=TessAPI1.TessBaseAPICreate();
    TessAPI1.TessBaseAPIInit3(handle, "data", "eng");
    try
    {
     double w=pdDoc.getPage(page-1).getCropBox().getWidth();
     double h=pdDoc.getPage(page-1).getCropBox().getHeight();

     PDFRenderer pdfRenderer = new PDFRenderer(pdDoc);
     BufferedImage image = pdfRenderer.renderImageWithDPI(page-1, 300,ImageType.GRAY);
     //image=ImageHelper.convertImageToGrayscale(image);
     ImageIOUtil.writeImage(image,"G:/Trial/tempImg.png", 300);
     int bpp = image.getColorModel().getPixelSize();
     int bytespp = bpp / 8;
     int bytespl = (int) Math.ceil(image.getWidth() * bpp / 8.0);
     int height = image.getHeight();
     int width = image.getWidth();

     TessAPI1.TessBaseAPISetImage(handle, ImageIOHelper.convertImageData(image), width, height, bytespp, bytespl);
     TessAPI1.TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO);
     image.flush();
     image=null;
     ETEXT_DESC monitor = new ETEXT_DESC();
     TessAPI1.TessBaseAPIRecognize(handle, monitor);
     TessResultIterator ri = TessAPI1.TessBaseAPIGetIterator(handle);
     TessPageIterator pi = TessAPI1.TessResultIteratorGetPageIterator(ri);
     TessAPI1.TessPageIteratorBegin(pi);
     int level = TessAPI1.TessPageIteratorLevel.RIL_WORD;
     ArrayList<Words> wd=new ArrayList<Words>();
     do {
     Pointer ptr = TessAPI1.TessResultIteratorGetUTF8Text(ri, level);
     if(ptr == null || ptr.toString().length()==0)
         break;
     String word = ptr.getString(0);
     TessAPI1.TessDeleteText(ptr);
     float confidence = TessAPI1.TessResultIteratorConfidence(ri, level);
     IntBuffer leftB = IntBuffer.allocate(1);
     IntBuffer topB = IntBuffer.allocate(1);
     IntBuffer rightB = IntBuffer.allocate(1);
     IntBuffer bottomB = IntBuffer.allocate(1);
     TessAPI1.TessPageIteratorBoundingBox(pi, level, leftB, topB, rightB, bottomB);
     int left = leftB.get();
     int top = topB.get();
     int right = rightB.get();
     int bottom = bottomB.get();
     IntBuffer boldB = IntBuffer.allocate(1);
     IntBuffer italicB = IntBuffer.allocate(1);
     IntBuffer underlinedB = IntBuffer.allocate(1);
     IntBuffer monospaceB = IntBuffer.allocate(1);
     IntBuffer serifB = IntBuffer.allocate(1);
     IntBuffer smallcapsB = IntBuffer.allocate(1);
     IntBuffer pointSizeB = IntBuffer.allocate(1);
     IntBuffer fontIdB = IntBuffer.allocate(1);
     String fontName = TessAPI1.TessResultIteratorWordFontAttributes(ri, boldB, italicB, underlinedB,
     monospaceB, serifB, smallcapsB, pointSizeB, fontIdB);
     boolean bold = boldB.get() == TRUE;
     boolean italic = italicB.get() == TRUE;
     boolean underlined = underlinedB.get() == TRUE;
     boolean monospace = monospaceB.get() == TRUE;
     boolean serif = serifB.get() == TRUE;
     boolean smallcaps = smallcapsB.get() == TRUE;
     int pointSize = pointSizeB.get();
     int fontId = fontIdB.get();
     Words chr=new Words(); // Words is a user-defined class 
     chr.c=word.concat(" ");
     chr.b=bold==true?1:0;
     chr.x=left*w/width;
     chr.size=pointSize;
     chr.name=fontName;
     chr.y=top*h/height;
     chr.w=(right-left)*w/width;
     chr.h=(bottom-top)*h/height;
     if(pointSize>1)
         wd.add(chr);
     } while (TessAPI1.TessPageIteratorNext(pi, level) == TRUE);

     sortWords(wd);
     wordsToChar(wd,ch);
     wd=null;
    }
catch(Exception e)
    {
        System.out.println(e.toString());
        e.printStackTrace();
    }
    finally
    {
         //TessAPI1.TessBaseAPIClearAdaptiveClassifier(handle);
         TessAPI1.TessBaseAPIClearPersistentCache(handle);
         TessAPI1.TessBaseAPIClear(handle);
         //TessAPI1.TessBaseAPIDelete(handle);
         TessAPI1.TessBaseAPIEnd(handle);
         handle=null;
    }

}

If anyone spots anything out of the ordinary then do reply.

来源:https://stackoverflow.com/questions/38010063/facing-set-datapath-error-while-using-tesseract-in-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!