问题
What good libraries are there, in any common language, for converting PDF to HTML?
回答1:
PDFBox at apache has an html extraction capability. http://pdfbox.apache.org/
回答2:
If you are working on a Windows box, I think Amyuni has a library for this as well. Their PDF Document Convertor is accessible as a DLL, can be used widely among the languages supported by Visual Studio, and can convert to RTF, TML, EXCEL, JPEG, and TIFF.
回答3:
http://www.lowagie.com/iText/ Opensource library for both Java and C#
回答4:
The pdftohtml program converts pdf to html and xml and preserves position information of the text which is helpful for scraping tables..
It seems to be based on the xpdf library and has a windows binary, too.
回答5:
In linux install pdftohtml - For batch convertion of all files in a folder use:
ls *.pdf | xargs -I{} pdftohtml {}
This will create html site with all references and images from original documents. Every page in a separate html file. Very useful to convert project documentation to search for files by phrase, using common system file search.
回答6:
In Perl, you can use the SWISH::Filter plugin SWISH::Filters::Pdf2HTML. (It requires the xpdf package.)
For the reverse (HTML to PDF), see this question.
回答7:
if you're looking for a way to convert PDF to HTML once or twice then I recommend Adobe Online Conversion
If it's an API you're after then http://www.pdfonline.com/ has an SDK that should suit your needs.
If it's a library you're after then please let us know which server-side language you prefer.
回答8:
Given the vagueness of the original question I'm going to go ahead and give a solution that will work with any language that can execute command-line apps. Although it can be a little bit tricky to get setup, OpenOffice can be run in headless mode on a server and, with the help of jodconverter, can convert any file format to any other file format (well, any format conversions that openoffice can handle, that is).
Here are a couple of links that help with the setup:
- http://iwonderdesigns.posterous.com/how-to-run-jodconverteropenoffice-on-your-hos
- http://www.artofsolving.com/node/10
来源:https://stackoverflow.com/questions/1638937/how-can-i-convert-pdf-to-html