Ruby: Reading PDF files

前端 未结 6 662
孤独总比滥情好
孤独总比滥情好 2020-12-02 06:05

I\'m looking for a fast and reliable way to read/parse large PDF files in Ruby (on Linux and OSX).

Until now I\'ve found the rather old and simple PDF-toolkit (a pd

6条回答
  •  误落风尘
    2020-12-02 06:49

    Here's some options:

    http://en.wikipedia.org/wiki/List_of_PDF_software

    From that link, and searching sourceforge, there's a couple of command line utilities that might do what you want, like this one: http://pdftohtml.sourceforge.net/

    Depending on your requirements and what the PDFs look like, you could look at using the Google Docs API (uploading the PDF and then downloading it as text), or could also try something like gocr. I've had a lot of luck parsing image text with gocr in the past, and you'd just have to bounce out to the shell to do it, like gocr -i whatever.pdf (I think it works with PDFs).

    The downside to all of these is that they're not pure-Ruby implementations, but lots of the good (and free) OCR projects seem to be done that way.

提交回复
热议问题