iText C# Read pdf for regular expression match, extract only those pages to new pdf

一笑奈何 提交于 2019-12-11 20:54:21

问题


I'm having an issue reading an existing pdf for regular expression matches, then extracting those pages to a new pdf. I've run into some issues with this as a whole.

I've decided to clear my head and start again from scratch. I'm able to take a 3 page pdf and extract the pages individually into a new file using this code:

static void Main(string[] args)
    {
        string srcFile = @"C:\Users\steve\Desktop\original.pdf";
        string dstFile = @"C:\Users\steve\Desktop\result.pdf";
        PdfReader reader = new PdfReader(srcFile);
        Document document = new Document();
        PdfCopy copy = new PdfCopy(document, new FileStream(dstFile, FileMode.Create));
        document.Open();
        for (int page = 1; page <= reader.NumberOfPages; page++)
        {
            PdfImportedPage importedPage = copy.GetImportedPage(reader, page);
            copy.AddPage(importedPage);
        }
        document.Close();
    }

This code works because the PdfCopy instance is OUTSIDE the for loop. The issue I'm running into is that the only way I can seem to get the code (for converting to text and finding regex matches) is to place that functionality (to include the PdfCopy instance) inside the for loop.

Here's the code from my initial question: C# iTextSharp - Code overwriting instead of appending pages


回答1:


As @Paulo already proposed in a comment:

You have to select the pages with regex or whatever other way before entering the loop. Inside the loop only those pages will the added.

In code this could look like this:

string srcFile = @"C:\Users\steve\Desktop\original.pdf";
string dstFile = @"C:\Users\steve\Desktop\result.pdf";

PdfReader reader = new PdfReader(srcFile);
ICollection<int> pagesToKeep = new List<int>();

for (int page = 1; page <= reader.NumberOfPages; page++)
{
    // Use the text extraction strategy of your choice here...
    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
    string currentText = PdfTextExtractor.GetTextFromPage(reader, page, strategy);

    // Use the content text test of your choice here...
    if (currentText.IndexOf("special") > 0)
    {
        pagesToKeep.Add(page);
    }
}

// Copy selected pages using PdfCopy
Document document = new Document();
PdfCopy copy = new PdfCopy(document, new FileStream(dstFile, FileMode.Create));
document.Open();
foreach (int page in pagesToKeep)
{
    PdfImportedPage importedPage = copy.GetImportedPage(reader, page);
    copy.AddPage(importedPage);
}
document.Close();
reader.Close();

The code can be further streamlined by using a PdfStamper instead of PdfCopy. Simply replace the lines from // Copy selected pages using PdfCopy onwards by

// Copy selected pages using PdfStamper
reader.SelectPages(pagesToKeep);
PdfStamper stamper = new PdfStamper(reader, new FileStream(dstFile, FileMode.Create, FileAccess.Write));
stamper.Close();

The latter variant not only keeps the pages in question but also document level material, e.g. global JavaScript, document-level file attachments, etc. Whether or not you want that, depends on your use case.




回答2:


Thank you for your response mkl. I answered my other post but forgot about this one. I was able to use the test case provided by Chris in my other (similar) post.

C# iTextSharp - Code overwriting instead of appending pages

With some minor tweaks I was able to get the solution below to work for my project.



来源:https://stackoverflow.com/questions/28654370/itext-c-sharp-read-pdf-for-regular-expression-match-extract-only-those-pages-to

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!