Get text occurrences contained in a specified area with iTextSharp

…衆ロ難τιáo~ 提交于 2019-11-29 15:12:03

问题


Is it possible, using iTextSharp, get all text occurrences contained in a specified area of ​​a pdf document?

Thanks.


回答1:


First you need the actual coordinates of the rectangle you marked in Red. On sight, I'd say the x value 144 (2 inches) is probably about right, but it would surprise me if the y value is 76, so you'll have to double check.

Once you have the exact coordinates of the rectangle, you can use iText's text extraction functionality using a LocationTextExtractionStrategy as is done in the ExtractPageContentArea example.

For the iTextSharp version of this example, see the C# port of the examples of chapter 15.

System.util.RectangleJ rect = new System.util.RectangleJ(70, 80, 420, 500);
RenderFilter[] filter = {new RegionTextRenderFilter(rect)};
ITextExtractionStrategy strategy = new FilteredTextRenderListener(
        new LocationTextExtractionStrategy(), filter);
text = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);



回答2:


@BrunoLowagie gives an excellent answer but something I really struggled with was getting the actual coordinates to use. I started out with using Cursor Coordinates from Adobe Acrobat Pro.

From here I could get the coordinate in inches and calculate the DTP point (PostScript points) by multiplying the value with 72.

However something was still not right. Looking at the Y value this seemed way off. I then noticed that Adobe Acrobat counts coordinates in this view from the top left instead of bottom left. This means that Y needs to be calculated.

I solved this in code like this:

var rect = new RectangleJ(GetPostScriptPoints(4.19f), 
    GetPostScriptPoints(GetInverseCoordinateInInches(pdfReader, 1, 1.42f)),
    GetPostScriptPoints(3.5f), GetPostScriptPoints(0.39f));

RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(
        new LocationTextExtractionStrategy(), filter);
var output = PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);

private float GetPostScriptPoints(float inch)
{
    return inch * 72;
}

private float GetInverseCoordinateInInches(PdfReader pdfReader, int pageIndex, float coordinateInInches)
{
    Rectangle mediabox = pdfReader.GetPageSize(pageIndex); 
    return mediabox.Height / 72 - coordinateInInches; 
}

This worked but I think it looks a little messy. I then used the tool Prepare Form in Adobe Acrobat Pro and here the Y coordinate showed upp correctly when looking at Text Field Properties. It could also convert the box into points right away.

This means I could write code like this instead:

var rect = new RectangleJ(301.68f, 738f, 252f, 28.08f);

RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
ITextExtractionStrategy strategy = new FilteredTextRenderListener(
        new LocationTextExtractionStrategy(), filter);
var output = PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);

This was a lot cleaner and faster so this was the way I choose to do it in the end.

See this answer if you would like to get a value from a specific location for every page in the document:

https://stackoverflow.com/a/20959388/3850405



来源:https://stackoverflow.com/questions/20606467/get-text-occurrences-contained-in-a-specified-area-with-itextsharp

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!