What's a good method for extracting text from a PDF using C# or classic ASP (VBScript)? [closed]

问题

Is there a good library for extracting text from a PDF? I'm willing to pay for it if I have to.

Something that works with C# or classic ASP (VBScript) would be ideal and I also need to be able to separate the pages from the PDF.

This question had some interesting stuff, especially pdftotext but I'd like to avoid calling to an external command-line app if I can.

回答1:

You can use the IFilter interface built into Windows to extract text and properties (author, title, etc.) from any supported file type. It's a COM interface so you would have use the .NET interop facilities.

You'd also have to download the free PDF IFilter driver from Adobe.

回答2:

Here is a good list: Open Source Libs for PDF/C#

Most of these are geared toward creating PDFs, but they should have read capability as well.

There is this one as well: iText

I have only played with iText before. Nothing major.

回答3:

We've used Aspose with good results.

回答4:

Docotic.Pdf library can be used to extract formatted or plain text from PDF documents.

The library can read PDF documents of any version (up to the latest published standard). Extraction of pages is also supported by the library.

Links to sample code:

How to extract text from PDF
How to extract PDF pages

Disclaimer: I work for the vendor of the library.

回答5:

Addition to the to the approved answer: there are also alternative commercial solutions to replace Adobe IFilter for text indexing (providing the similar API but also offering additional premium functionality):

Foxit PDF IFilter: provides much faster text indexing comparing to Adobe's plugin.
PDFLib PDF iFilter: includes support for damaged PDF documents plus the additional API to run your own queries.

If you are looking for the single tool that can be used from both managed .NET apps and legacy programming languages like classic ASP or VB6 then this is where the commercial ByteScout PDF Extractor SDK would fit as it provides both .NET and ActiveX/COM API.

Disclaimer: I work for ByteScout

来源：https://stackoverflow.com/questions/46869/whats-a-good-method-for-extracting-text-from-a-pdf-using-c-sharp-or-classic-asp

标签

pdf

text-extraction

pdf-scraping