Read columns of PDF in C# using ItextSharp

北慕城南 提交于 2019-12-13 05:19:05

问题


In my progam I extracted text from a PDF file and it works well. ItextSharp extracts text from PDF line by line. However, when a PDF file contains 2 columns, the extracted text is not ok as in each line joins two columns.

My problem is: How can I extract text column by column?

Below is my code. PDF files are Arabic. I'm sorry my English is not so good.

PdfReader reader = new PdfReader(@"D:\test pdf\Blood Journal.pdf");
int intPageNum = reader.NumberOfPages;
string[] words;
string line;

for (int i = 1; i <= intPageNum; i++)
{
    text = PdfTextExtractor.GetTextFromPage(reader, i, 
               new LocationTextExtractionStrategy());

    words = text.Split('\n');
    for (int j = 0, len = words.Length; j < len; j++)
    {
        line = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(words[j]));
        // other things here
    }

    // other things here
}

回答1:


You may want to use RegionTextRenderFilter to restrict a column region then use LocationTextExtractionStrategy to extract the text. However this requires prior knowledge to the PDF file your are parsing, i.e. you need information about the column's position and size.

In more details, you need to pass in the coordinates of your column to define a rectangle, then extract the text from that rectangle. A sample will be like this:

PdfReader reader = new PdfReader(@"D:\test pdf\Blood Journal.pdf");
int intPageNum = reader.NumberOfPages;    

private string GetColumnText(float llx, float lly, float urx, float ury)
{
    // reminder, parameters are in points, and 1 in = 2.54 cm = 72 points
    var rect = new iTextSharp.text.Rectangle(llx, lly, urx, ury);

    var renderFilter = new RenderFilter[1];
    renderFilter[0] = new RegionTextRenderFilter(rect);

    var textExtractionStrategy =
            new FilteredTextRenderListener(new LocationTextExtractionStrategy(),
                                           renderFilter);

    var text = PdfTextExtractor.GetTextFromPage(reader, intPageNum,
                                                textExtractionStrategy);

    return text;
}

Here is another post discussing what you want, you may want to check as well: iTextSharp - Reading PDF with 2 columns. But they didn't hit the solution either :(



来源:https://stackoverflow.com/questions/25498598/read-columns-of-pdf-in-c-sharp-using-itextsharp

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!