iText C# Read pdf for regular expression match, extract only those pages to new pdf

问题

I'm having an issue reading an existing pdf for regular expression matches, then extracting those pages to a new pdf. I've run into some issues with this as a whole.

I've decided to clear my head and start again from scratch. I'm able to take a 3 page pdf and extract the pages individually into a new file using this code:

static void Main(string[] args)
    {
        string srcFile = @"C:\Users\steve\Desktop\original.pdf";
        string dstFile = @"C:\Users\steve\Desktop\result.pdf";
        PdfReader reader = new PdfReader(srcFile);
        Document document = new Document();
        PdfCopy copy = new PdfCopy(document, new FileStream(dstFile, FileMode.Create));
        document.Open();
        for (int page = 1; page <= reader.NumberOfPages; page++)
        {
            PdfImportedPage importedPage = copy.GetImportedPage(reader, page);
            copy.AddPage(importedPage);
        }
        document.Close();
    }

This code works because the PdfCopy instance is OUTSIDE the for loop. The issue I'm running into is that the only way I can seem to get the code (for converting to text and finding regex matches) is to place that functionality (to include the PdfCopy instance) inside the for loop.

Here's the code from my initial question: C# iTextSharp - Code overwriting instead of appending pages

回答1:

As @Paulo already proposed in a comment:

You have to select the pages with regex or whatever other way before entering the loop. Inside the loop only those pages will the added.

In code this could look like this:

string srcFile = @"C:\Users\steve\Desktop\original.pdf";
string dstFile = @"C:\Users\steve\Desktop\result.pdf";

PdfReader reader = new PdfReader(srcFile);
ICollection<int> pagesToKeep = new List<int>();

for (int page = 1; page <= reader.NumberOfPages; page++)
{
    // Use the text extraction strategy of your choice here...
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(reader, page, strategy);

    // Use the content text test of your choice here...
    if (currentText.IndexOf("special") > 0)
    {
        pagesToKeep.Add(page);
    }
}

// Copy selected pages using PdfCopy
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileStream(dstFile, FileMode.Create));
document.Open();
foreach (int page in pagesToKeep)
{
    PdfImportedPage importedPage = copy.GetImportedPage(reader, page);
    copy.AddPage(importedPage);
}
document.Close();
reader.Close();

The code can be further streamlined by using a PdfStamper instead of PdfCopy. Simply replace the lines from // Copy selected pages using PdfCopy onwards by

// Copy selected pages using PdfStamper
reader.SelectPages(pagesToKeep);
PdfStamper stamper = new PdfStamper(reader, new FileStream(dstFile, FileMode.Create, FileAccess.Write));
stamper.Close();

The latter variant not only keeps the pages in question but also document level material, e.g. global JavaScript, document-level file attachments, etc. Whether or not you want that, depends on your use case.

回答2:

Thank you for your response mkl. I answered my other post but forgot about this one. I was able to use the test case provided by Chris in my other (similar) post.

C# iTextSharp - Code overwriting instead of appending pages

With some minor tweaks I was able to get the solution below to work for my project.

来源：https://stackoverflow.com/questions/28654370/itext-c-sharp-read-pdf-for-regular-expression-match-extract-only-those-pages-to

标签

pdf

itextsharp