Unable to extract scanned pdf using TesseractOCRConfig Apache Tika

后端 未结 1 1153
迷失自我
迷失自我 2021-01-15 10:04

My pdf contains scanned images and I want to extract text from it.

What I tried : I tried with AutoDetectParsers but no output.

I followed the solution provi

1条回答
  •  青春惊慌失措
    2021-01-15 10:17

    Steps to follow to solve this :

    1. Install Tesseract in your system using 'tesseract-ocr-setup-3.05.00dev.exe' for Windows from: https://sourceforge.net/projects/tesseract-ocr-alt/files/ and set its location in your config.

      Java code :

      Parser parser = new AutoDetectParser();
      BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
      TesseractOCRConfig config = new TesseractOCRConfig();
      config.setTesseractPath(tPath);
      PDFParserConfig pdfConfig = new PDFParserConfig();
      pdfConfig.setExtractInlineImages(true);
      pdfConfig.setExtractUniqueInlineImagesOnly(false); // set to false if pdf contains multiple images.
      ParseContext parseContext = new ParseContext();
      parseContext.set(TesseractOCRConfig.class, config);
      parseContext.set(PDFParserConfig.class, pdfConfig);
      //need to add this to make sure recursive parsing happens!
      parseContext.set(Parser.class, parser);
      
    2. Maven dependencies :

    org.apache.tika tika-parsers 1.13 com.levigo.jbig2 levigo-jbig2-imageio 1.6.5 com.github.jai-imageio jai-imageio-core 1.3.1

    I think it may be helpful. Thanks.

    0 讨论(0)
提交回复
热议问题