I\'m looking for a fast and reliable way to read/parse large PDF files in Ruby (on Linux and OSX).
Until now I\'ve found the rather old and simple PDF-toolkit (a pd
Here's some options:
http://en.wikipedia.org/wiki/List_of_PDF_software
From that link, and searching sourceforge, there's a couple of command line utilities that might do what you want, like this one: http://pdftohtml.sourceforge.net/
Depending on your requirements and what the PDFs look like, you could look at using the Google Docs API (uploading the PDF and then downloading it as text), or could also try something like gocr. I've had a lot of luck parsing image text with gocr in the past, and you'd just have to bounce out to the shell to do it, like gocr -i whatever.pdf (I think it works with PDFs).
The downside to all of these is that they're not pure-Ruby implementations, but lots of the good (and free) OCR projects seem to be done that way.