Read columns of PDF in C# using ItextSharp

问题

In my progam I extracted text from a PDF file and it works well. ItextSharp extracts text from PDF line by line. However, when a PDF file contains 2 columns, the extracted text is not ok as in each line joins two columns.

My problem is: How can I extract text column by column?

Below is my code. PDF files are Arabic. I'm sorry my English is not so good.

PdfReader reader = new PdfReader(@"D:\test pdf\Blood Journal.pdf");
int intPageNum = reader.NumberOfPages;
string[] words;
string line;

for (int i = 1; i <= intPageNum; i++)
{
    text = PdfTextExtractor.GetTextFromPage(reader, i, 
               new LocationTextExtractionStrategy());

    words = text.Split('\n');
    for (int j = 0, len = words.Length; j < len; j++)
    {
        line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
        // other things here
    }

    // other things here
}

回答1:

You may want to use RegionTextRenderFilter to restrict a column region then use LocationTextExtractionStrategy to extract the text. However this requires prior knowledge to the PDF file your are parsing, i.e. you need information about the column's position and size.

In more details, you need to pass in the coordinates of your column to define a rectangle, then extract the text from that rectangle. A sample will be like this:

PdfReader reader = new PdfReader(@"D:\test pdf\Blood Journal.pdf");
int intPageNum = reader.NumberOfPages;    

private string GetColumnText(float llx, float lly, float urx, float ury)
{
    // reminder, parameters are in points, and 1 in = 2.54 cm = 72 points
    var rect = new iTextSharp.text.Rectangle(llx, lly, urx, ury);

    var renderFilter = new RenderFilter[1];
    renderFilter[0] = new RegionTextRenderFilter(rect);

    var textExtractionStrategy =
            new FilteredTextRenderListener(new LocationTextExtractionStrategy(),
                                           renderFilter);

    var text = PdfTextExtractor.GetTextFromPage(reader, intPageNum,
                                                textExtractionStrategy);

    return text;
}

Here is another post discussing what you want, you may want to check as well: iTextSharp - Reading PDF with 2 columns. But they didn't hit the solution either :(

来源：https://stackoverflow.com/questions/25498598/read-columns-of-pdf-in-c-sharp-using-itextsharp

标签

itextsharp