C# iTextSharp - Code overwriting instead of appending pages

安稳与你 提交于 2019-12-02 04:18:23
Chris Haas

Bruno did a great job explaining the problem and how to fix it but since you've said that you are new to programming and you've further posted a very similar and related question I'm going to go a little deeper to hopefully help you.

First, let's write down the knowns:

  1. There's a directory full of PDFs
  2. Each PDF has only a single page

Then the objectives:

  1. Extract the text of each PDF
  2. Compare the extracted text with a pattern
  3. If there's a match, then using the match for a file name do one of:
    1. If a file exists append the source PDF to it
    2. If there isn't a match, create a new file with the PDF

There's a couple of things that you need to know before proceeding. You tried to work in "append mode" by using FileMode.OpenOrCreate. It was a good guess but incorrect. The PDF format has both an beginning and an end, so "start here" and "end here". When you attempt to append another PDF (or anything for that matter) to an existing file you are just writing past the "end here" section. At best, that's junk data that gets ignored but more likely you'll end up with a corrupt PDF. The same is true of almost any file format. Two XML files concatenated is invalid because an XML document can only have one root element.

Second but related, iText/iTextSharp cannot edit existing files. This is very important. It can, however, create brand new files that happen to have the exact or possibly modified versions of other files. I don't know if I can stress how important this is.

Third, you are using a line that get's copied over and over again but is very wrong and actually can corrupt your data. For why it is bad, read this.

currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));

Fourth, you are using RegEx which is an overly complicated way to perform a search. Maybe the code that you posted was just a sample but if it wasn't I would recommend just using currentText.Contains("") or if you need to ignore case currentText.IndexOf( "", StringComparison.InvariantCultureIgnoreCase ). For the benefit of the doubt, the code below assumes you have a more complex RegEx.

With all that, below is a full working example that should walk you through everything. Since we don't have access to your PDFs, the second section actually creates 100 sample PDFs with our search terms occasionally added to them. Your real code obviously wouldn't do this but we need common ground to work with you on. The third section is the search and merge feature that you are trying to do. Hopefully the comments in the code explain everything.

/**
 * Step 1 - Variable Setup
 */

//This is the folder that we'll be basing all other directory paths on
var workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);

//This folder will hold our PDFs with text that we're searching for
var folderPathContainingPdfsToSearch = Path.Combine(workingFolder, "Pdfs");

var folderPathContainingPdfsCombined = Path.Combine(workingFolder, "Pdfs Combined");

//Create our directories if they don't already exist
System.IO.Directory.CreateDirectory(folderPathContainingPdfsToSearch);
System.IO.Directory.CreateDirectory(folderPathContainingPdfsCombined);

var searchText1 = "ABC";
var searchText2 = "DEF";

/**
 * Step 2 - Create sample PDFs
 */

//Create 100 sample PDFs
for (var i = 0; i < 100; i++) {
    using (var fs = new FileStream(Path.Combine(folderPathContainingPdfsToSearch, i.ToString() + ".pdf"), FileMode.Create, FileAccess.Write, FileShare.None)) {
        using (var doc = new Document()) {
            using (var writer = PdfWriter.GetInstance(doc, fs)) {
                doc.Open();

                //Add a title so we know what page we're on when we combine
                doc.Add(new Paragraph(String.Format("This is page {0}", i)));

                //Add various strings every once in a while.
                //(Yes, I know this isn't evenly distributed but I haven't
                // had enough coffee yet.)
                if (i % 10 == 3) {
                    doc.Add(new Paragraph(searchText1));
                } else if (i % 10 == 6) {
                    doc.Add(new Paragraph(searchText2));
                } else if (i % 10 == 9) {
                    doc.Add(new Paragraph(searchText1 + searchText2));
                } else {
                    doc.Add(new Paragraph("Blah blah blah"));
                }

                doc.Close();
            }
        }
    }
}

/**
 * Step 3 - Search and merge
 */


//We'll search for two different strings just to add some spice
var reg = new Regex("(" + searchText1 + "|" + searchText2 + ")");

//Loop through each file in the directory
foreach (var filePath in Directory.EnumerateFiles(folderPathContainingPdfsToSearch, "*.pdf")) {
    using (var pdfReader = new PdfReader(filePath)) {
        for (var page = 1; page <= pdfReader.NumberOfPages; page++) {

            //Get the text from the page
            var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, new SimpleTextExtractionStrategy());

            currentText.IndexOf( "",  StringComparison.InvariantCultureIgnoreCase )



            //DO NOT DO THIS EVER!! See this for why https://stackoverflow.com/a/10191879/231316
            //currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));

            //Match our pattern against the extracted text
            var matches = reg.Matches(currentText);

            //Bail early if we can
            if (matches.Count == 0) {
                continue;
            }

            //Loop through each match
            foreach (var m in matches) {

                //This is the file path that we want to target
                var destFile = Path.Combine(folderPathContainingPdfsCombined, m.ToString() + ".pdf");

                //If the file doesn't already exist then just copy the file and move on
                if (!File.Exists(destFile)) {
                    System.IO.File.Copy(filePath, destFile);
                    continue;
                }

                //The file exists so we're going to "append" the page
                //However, writing to the end of file in Append mode doesn't work,
                //that would be like "add a file to a zip" by concatenating two
                //two files. In this case, we're actually creating a brand new file
                //that "happens" to contain the original file and the matched file.
                //Instead of writing to disk for this new file we're going to keep it
                //in memory, delete the original file and write our new file
                //back onto the old file
                using (var ms = new MemoryStream()) {

                    //Use a wrapper helper provided by iText
                    var cc = new PdfConcatenate(ms);

                    //Open for writing
                    cc.Open();

                    //Import the existing file
                    using (var subReader = new PdfReader(destFile)) {
                        cc.AddPages(subReader);
                    }

                    //Import the matched file
                    //The OP stated a guarantee of only 1 page so we don't
                    //have to mess around with specify which page to import.
                    //Also, PdfConcatenate closes the supplied PdfReader so
                    //just use the variable pdfReader.
                    using (var subReader = new PdfReader(filePath)) {
                        cc.AddPages(subReader);
                    }

                    //Close for writing
                    cc.Close();

                    //Erase our exisiting file
                    File.Delete(destFile);

                    //Write our new file
                    File.WriteAllBytes(destFile, ms.ToArray());
                }
            }
        }
    }
}

I'll write this in pseudo code.

You do something like this:

// loop over different single-page documents
for () {
    // introduce a condition
    if (condition == met) {
        // create single-page PDF
        new Document();
        new PdfCopy();
        document.Open();
        copy.add(singlePage);
        document.Close();
    }
}

This means that you are creating a single-page PDF every time the condition is met. Incidentally, you're overwriting existing files many times.

What you should do, is something like this:

// Create a document with as many pages as times a condition is met
new Document();
new PdfCopy();
document.Open();
// loop over different single-page documents
for () {
    // introduce a condition
    if (condition == met) {
        copy.addPage(singlePage);
    }
}
document.Close();

Now you are possibly adding more than one page to the new document you are creating with PdfCopy. Be careful: an exception can be thrown if the condition is never met.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!