C# PDFSharp: Examples of how to strip text from PDF?

问题

I have a fairly simple task: I need to read a PDF file and write out its image contents while ignoring its text contents. So essentially I need to do the complement of "save as text".

Ideally, I would prefer to avoid any sort of re-compression of the image contents but if it's not possible, it's ok too.

Are the examples of how to do it?

Thanks!

回答1:

Extracting text from a PDF file with PDFsharp is not a simple task.

It was discussed recently in this thread: https://stackoverflow.com/a/9161732/162529

回答2:

Extracting text from a PDF with PdfSharp can actually be very easy, depending on the document type and what you intend to do with it. If the text is in the document as text, and not an image, and you don't care about the position or format, then it's quite simple. This code gets all of the text of the first page in the PDFs I'm working with:

var doc = PdfReader.Open(docPath);
string pageText = doc.Pages[0].Contents.Elements.GetDictionary(0).Stream.ToString();

doc.Pages.Count gives you the total number of pages, and you access each one through the doc.Pages array with the index. I don't recommend using foreach and Linq here, as the interfaces aren't implemented well. The index passed into GetDictionary is for which PDF document element - this may vary based on how the documents are produced. If you don't get the text you're looking for, try looping through all of the elements.

The text that this produces will be full of various PDF formatting codes. If all you need to do is extract strings, though, you can find the ones you want using Regex or any other appropriate string searching code. If you need to do anything with the formatting or positioning, then good luck - from what I can tell, you'll need it.

回答3:

Example of PDFSharp libraries extracting images from .pdf file:

link

library

EDIT:

Then if you want to extract text from image you have to use OCR libraries.

There are two good OCRs tessnet and MODI
Link to thread on stack
But I fully can recommend MODI which I am using now. Some sample @ codeproject.

EDIT 2 :

If you don't want to read text from extracted images, you should write new PDF document and put all of them into it. For writing PDFs I use MigraDoc. It is not difficult to use that library.

来源：https://stackoverflow.com/questions/9591992/c-sharp-pdfsharp-examples-of-how-to-strip-text-from-pdf

标签

text

pdfsharp