Unable to extract scanned pdf using TesseractOCRConfig Apache Tika
问题 My pdf contains scanned images and I want to extract text from it. What I tried : I tried with AutoDetectParsers but no output. I followed the solution provided in Apache Tika extract scanned PDF files and also Apache Tika Jira at https://issues.apache.org/jira/browse/TIKA-1729 but getting empty string without any error. My configuration : Win 7 64-bit OS, JDK 1.8.0_45. Any kind of help is welcome. 回答1: Steps to follow to solve this : Install Tesseract in your system using 'tesseract-ocr