问题
I've seen many posts that have helped me get to where I am, I'm new to programming. My intention is to get the files within the directory "sourceDir" and look for a Regex Match. When it finds a Match, I want to create a new file with the Match as the name. If the code finds another file with the same Match (the file already exists) then create a new page within that document.
Right now the code works, however instead of adding a new page, it overwrites the first page of the document. NOTE: Every document in the directory is only one page!
string sourceDir = @"C:\Users\bob\Desktop\results\";
string destDir = @"C:\Users\bob\Desktop\results\final\";
string[] files = Directory.GetFiles(sourceDir);
foreach (string file in files)
{
using (var pdfReader = new PdfReader(file.ToString()))
{
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
var text = new StringBuilder();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
var currentText =
PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
Regex reg = new Regex(@"ABCDEFG");
MatchCollection matches = reg.Matches(currentText);
foreach (Match m in matches)
{
string newFile = destDir + m.ToString() + ".pdf";
if (!File.Exists(newFile))
{
using (PdfReader reader = new PdfReader(File.ReadAllBytes(file)))
{
using (Document doc = new Document(reader.GetPageSizeWithRotation(page)))
{
using (PdfCopy copy = new PdfCopy(doc, new FileStream(newFile, FileMode.Create)))
{
var importedPage = copy.GetImportedPage(reader, page);
doc.Open();
copy.AddPage(importedPage);
doc.Close();
}
}
}
}
else
{
using (PdfReader reader = new PdfReader(File.ReadAllBytes(newFile)))
{
using (Document doc = new Document(reader.GetPageSizeWithRotation(page)))
{
using (PdfCopy copy = new PdfCopy(doc, new FileStream(newFile, FileMode.OpenOrCreate)))
{
var importedPage = copy.GetImportedPage(reader, page);
doc.Open();
copy.AddPage(importedPage);
doc.Close();
}
}
}
}
}
}
}
}
回答1:
Bruno did a great job explaining the problem and how to fix it but since you've said that you are new to programming and you've further posted a very similar and related question I'm going to go a little deeper to hopefully help you.
First, let's write down the knowns:
- There's a directory full of PDFs
- Each PDF has only a single page
Then the objectives:
- Extract the text of each PDF
- Compare the extracted text with a pattern
- If there's a match, then using the match for a file name do one of:
- If a file exists append the source PDF to it
- If there isn't a match, create a new file with the PDF
There's a couple of things that you need to know before proceeding. You tried to work in "append mode" by using FileMode.OpenOrCreate
. It was a good guess but incorrect. The PDF format has both an beginning and an end, so "start here" and "end here". When you attempt to append another PDF (or anything for that matter) to an existing file you are just writing past the "end here" section. At best, that's junk data that gets ignored but more likely you'll end up with a corrupt PDF. The same is true of almost any file format. Two XML files concatenated is invalid because an XML document can only have one root element.
Second but related, iText/iTextSharp cannot edit existing files. This is very important. It can, however, create brand new files that happen to have the exact or possibly modified versions of other files. I don't know if I can stress how important this is.
Third, you are using a line that get's copied over and over again but is very wrong and actually can corrupt your data. For why it is bad, read this.
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
Fourth, you are using RegEx which is an overly complicated way to perform a search. Maybe the code that you posted was just a sample but if it wasn't I would recommend just using currentText.Contains("")
or if you need to ignore case currentText.IndexOf( "", StringComparison.InvariantCultureIgnoreCase )
. For the benefit of the doubt, the code below assumes you have a more complex RegEx.
With all that, below is a full working example that should walk you through everything. Since we don't have access to your PDFs, the second section actually creates 100 sample PDFs with our search terms occasionally added to them. Your real code obviously wouldn't do this but we need common ground to work with you on. The third section is the search and merge feature that you are trying to do. Hopefully the comments in the code explain everything.
/**
* Step 1 - Variable Setup
*/
//This is the folder that we'll be basing all other directory paths on
var workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
//This folder will hold our PDFs with text that we're searching for
var folderPathContainingPdfsToSearch = Path.Combine(workingFolder, "Pdfs");
var folderPathContainingPdfsCombined = Path.Combine(workingFolder, "Pdfs Combined");
//Create our directories if they don't already exist
System.IO.Directory.CreateDirectory(folderPathContainingPdfsToSearch);
System.IO.Directory.CreateDirectory(folderPathContainingPdfsCombined);
var searchText1 = "ABC";
var searchText2 = "DEF";
/**
* Step 2 - Create sample PDFs
*/
//Create 100 sample PDFs
for (var i = 0; i < 100; i++) {
using (var fs = new FileStream(Path.Combine(folderPathContainingPdfsToSearch, i.ToString() + ".pdf"), FileMode.Create, FileAccess.Write, FileShare.None)) {
using (var doc = new Document()) {
using (var writer = PdfWriter.GetInstance(doc, fs)) {
doc.Open();
//Add a title so we know what page we're on when we combine
doc.Add(new Paragraph(String.Format("This is page {0}", i)));
//Add various strings every once in a while.
//(Yes, I know this isn't evenly distributed but I haven't
// had enough coffee yet.)
if (i % 10 == 3) {
doc.Add(new Paragraph(searchText1));
} else if (i % 10 == 6) {
doc.Add(new Paragraph(searchText2));
} else if (i % 10 == 9) {
doc.Add(new Paragraph(searchText1 + searchText2));
} else {
doc.Add(new Paragraph("Blah blah blah"));
}
doc.Close();
}
}
}
}
/**
* Step 3 - Search and merge
*/
//We'll search for two different strings just to add some spice
var reg = new Regex("(" + searchText1 + "|" + searchText2 + ")");
//Loop through each file in the directory
foreach (var filePath in Directory.EnumerateFiles(folderPathContainingPdfsToSearch, "*.pdf")) {
using (var pdfReader = new PdfReader(filePath)) {
for (var page = 1; page <= pdfReader.NumberOfPages; page++) {
//Get the text from the page
var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, new SimpleTextExtractionStrategy());
currentText.IndexOf( "", StringComparison.InvariantCultureIgnoreCase )
//DO NOT DO THIS EVER!! See this for why https://stackoverflow.com/a/10191879/231316
//currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
//Match our pattern against the extracted text
var matches = reg.Matches(currentText);
//Bail early if we can
if (matches.Count == 0) {
continue;
}
//Loop through each match
foreach (var m in matches) {
//This is the file path that we want to target
var destFile = Path.Combine(folderPathContainingPdfsCombined, m.ToString() + ".pdf");
//If the file doesn't already exist then just copy the file and move on
if (!File.Exists(destFile)) {
System.IO.File.Copy(filePath, destFile);
continue;
}
//The file exists so we're going to "append" the page
//However, writing to the end of file in Append mode doesn't work,
//that would be like "add a file to a zip" by concatenating two
//two files. In this case, we're actually creating a brand new file
//that "happens" to contain the original file and the matched file.
//Instead of writing to disk for this new file we're going to keep it
//in memory, delete the original file and write our new file
//back onto the old file
using (var ms = new MemoryStream()) {
//Use a wrapper helper provided by iText
var cc = new PdfConcatenate(ms);
//Open for writing
cc.Open();
//Import the existing file
using (var subReader = new PdfReader(destFile)) {
cc.AddPages(subReader);
}
//Import the matched file
//The OP stated a guarantee of only 1 page so we don't
//have to mess around with specify which page to import.
//Also, PdfConcatenate closes the supplied PdfReader so
//just use the variable pdfReader.
using (var subReader = new PdfReader(filePath)) {
cc.AddPages(subReader);
}
//Close for writing
cc.Close();
//Erase our exisiting file
File.Delete(destFile);
//Write our new file
File.WriteAllBytes(destFile, ms.ToArray());
}
}
}
}
}
回答2:
I'll write this in pseudo code.
You do something like this:
// loop over different single-page documents
for () {
// introduce a condition
if (condition == met) {
// create single-page PDF
new Document();
new PdfCopy();
document.Open();
copy.add(singlePage);
document.Close();
}
}
This means that you are creating a single-page PDF every time the condition is met. Incidentally, you're overwriting existing files many times.
What you should do, is something like this:
// Create a document with as many pages as times a condition is met
new Document();
new PdfCopy();
document.Open();
// loop over different single-page documents
for () {
// introduce a condition
if (condition == met) {
copy.addPage(singlePage);
}
}
document.Close();
Now you are possibly adding more than one page to the new document you are creating with PdfCopy
. Be careful: an exception can be thrown if the condition is never met.
来源:https://stackoverflow.com/questions/27906454/c-sharp-itextsharp-code-overwriting-instead-of-appending-pages