I have a bunch of PDF files that came from scanned documents. The files contain a mix of images and text. Some were scanned as images with no OCR, so each PDF page is one
Sorry to dig up old thread, but if you found this have a look at my thread:
Batch OCR Program for PDFs
you can get extra information about the pdf by catting it in unix/linux/osx or opening it as "rb" mode in python. (course that's python and you didn't want to use that but maybe it has something equivalent).