When I extract text from a PDF file using iText I am getting values from previous pages

问题

I am trying to extract a block of text from a specific location from each page in a multiple page PDF file.

I have the location of the text, and I am able to extract it correctly on the first page. However on the pages after the first page, the text extracted seems to be accumulating.

For example if the text value on page 1 is "A", page 2 is "B" and Page 3 is "C" then I am receiving the following values in my output string for each iteration through my FOR loop:

Loop1 : output = A

Loop2 : output = B A

Loop3 : output = C B A

I am using iTextSharp in my project, written in C#.

Any help would be appreciated.

var reader = new PdfReader(foregroundFile);

RectangleJ customerIdRectangle = new RectangleJ(0, 495, 108, 27);
RenderFilter[] filters = new RenderFilter[1];
LocationTextExtractionStrategy regionFilter = new LocationTextExtractionStrategy();
filters[0] = new RegionTextRenderFilter(customerIdRectangle);
FilteredTextRenderListener strategy = new FilteredTextRenderListener(regionFilter, filters);

for (int i = 1; i <= reader.NumberOfPages; i++)
{
    string output = "";
    output = PdfTextExtractor.GetTextFromPage(reader, i, strategy);
    Console.WriteLine(output);
}

回答1:

Please adapt your code like this:

var reader = new PdfReader(foregroundFile);

RectangleJ customerIdRectangle = new RectangleJ(0, 495, 108, 27);

for (int i = 1; i <= reader.NumberOfPages; i++)
{
    RenderFilter[] filters = new RenderFilter[1];
    LocationTextExtractionStrategy regionFilter = new LocationTextExtractionStrategy();
    filters[0] = new RegionTextRenderFilter(customerIdRectangle);
    FilteredTextRenderListener strategy = new FilteredTextRenderListener(regionFilter, filters);
    string output = "";
    output = PdfTextExtractor.GetTextFromPage(reader, i, strategy);
    Console.WriteLine(output);
}

来源：https://stackoverflow.com/questions/20959292/when-i-extract-text-from-a-pdf-file-using-itext-i-am-getting-values-from-previou

标签

pdf

itextsharp

itext