Unable to extract scanned pdf using TesseractOCRConfig Apache Tika

有些话、适合烂在心里 提交于 2019-12-01 09:07:30

Steps to follow to solve this :

  1. Install Tesseract in your system using 'tesseract-ocr-setup-3.05.00dev.exe' for Windows from: https://sourceforge.net/projects/tesseract-ocr-alt/files/ and set its location in your config.

    Java code :

    Parser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
    TesseractOCRConfig config = new TesseractOCRConfig();
    config.setTesseractPath(tPath);
    PDFParserConfig pdfConfig = new PDFParserConfig();
    pdfConfig.setExtractInlineImages(true);
    pdfConfig.setExtractUniqueInlineImagesOnly(false); // set to false if pdf contains multiple images.
    ParseContext parseContext = new ParseContext();
    parseContext.set(TesseractOCRConfig.class, config);
    parseContext.set(PDFParserConfig.class, pdfConfig);
    //need to add this to make sure recursive parsing happens!
    parseContext.set(Parser.class, parser);
    
  2. Maven dependencies :

<dependencies> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.13</version> </dependency> <dependency> <groupId>com.levigo.jbig2</groupId> <artifactId>levigo-jbig2-imageio</artifactId> <version>1.6.5</version> </dependency> <dependency> <groupId>com.github.jai-imageio</groupId> <artifactId>jai-imageio-core</artifactId> <version>1.3.1</version> </dependency> </dependencies>

I think it may be helpful. Thanks.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!