PDF text extraction from given coordinates

前端 未结 3 1212
情话喂你
情话喂你 2020-11-27 10:43

I would like to extract text from a portion (using coordinates) of PDF using Ghostscript.

Can anyone help me out?

3条回答
  •  粉色の甜心
    2020-11-27 11:18

    Debenu Quick PDF Library can extract text from a defined area on a page. The SetTextExtractionArea function lets you specify the x and y coordinates and then you can also specify the width and height of the area.

    • Left = The horizontal coordinate of the left edge of the area
    • Top = The vertical coordinate of the top edge of the area
    • Width = The width of the area
    • Height = The height of the area

    Then the GetPageText function can be called immediately after this to extract the text from that defined area.

    Here's an example using C# (though the library is multi-platform and can be used with many different programming languages):

    DPL.LoadFromFile(@"Sample.pdf", "");
    DPL.SetOrigin(1); // Sets 0,0 coordinate position to top left of page, default is bottom left
    DPL.SetTextExtractionArea(35, 35, 229, 30); // Left, Top, Width, Height
    string ExtractedContent = DPL.GetPageText(8);
    Console.WriteLine(ExtractedContent);
    

    Using GetPageText it is also possible to return just the text located in that area or the text located in that area as well as information about the text's font such as name, color and size.

提交回复
热议问题