I am trying to identify the non-text data in a PDF in which all pages are saved as images. I am able to use Tesseract-ocr with Apache Tika to read the text in the pdf but fo