Java: Apache PDFbox Extract highlighted text

↘锁芯ラ 提交于 2019-11-28 07:52:59
mkl

The code in the question Not able to read the exact text highlighted across the lines already illustrates most concepts to use for extracting text from limited content regions on a page with PDFBox.

Having studied this code, the OP still wondered in a comment:

But one thing I am confused about is QuadPoints instead of Rect. as you mentioned there in comment. What are this, can you explain it with some code lines or in simple words, as I am also facing the same problem of multi lines highlghts?

In general the area an annotation refers to is a rectangle:

Rect rectangle (Required) The annotation rectangle, defining the location of the annotation on the page in default user space units.

(from Table 164 – Entries common to all annotation dictionaries - in ISO 32000-1)

For some annotations types (e.g. text markups), this location value does not suffice because:

  • text to markup may be written at some odd angle but the rectangle type mentioned in the specification refers to rectangles with edges parallel to the page edges; and
  • text to markup may start anywhere in a line and end anywhere in another one, so the markup area is not rectangular at all but it is the union of multiple rectangular parts.

To cope with such annotation types, therefore, the PDF specification provides a more generic way to define areas:

QuadPoints array (Required) An array of 8 × n numbers specifying the coordinates of n quadrilaterals in default user space. Each quadrilateral shall encompasses a word or group of contiguous words in the text underlying the annotation. The coordinates for each quadrilateral shall be given in the order

x1 y1 x2 y2 x3 y3 x4 y4

specifying the quadrilateral’s four vertices in counterclockwise order (see Figure 64). The text shall be oriented with respect to the edge connecting points (x1, y1) and (x2, y2).

(from Table 179 – Additional entries specific to text markup annotations - in ISO 32000-1)

Thus, instead of the rectangle given by

PDRectangle rect = pdfAnnot.getRectangle();

in the code in the referenced question, you have to consider the quadrilaterals given by

COSArray quadsArray = (COSArray) pdfAnnot.getDictionary().getDictionaryObject(COSName getPDFName("QuadPoints"));

and define regions for the PDFTextStripperByArea stripper accordingly. Unfortunately PDFTextStripperByArea.addRegion expects a rectangle as parameter, not some generic quadrilateral. As text usually is printed horizontally or vertically, that should not pose too big a problem.

PS One warning concerning the specification of the QuadPoints, the order may differ in real-life PDFs, cf. the question PDF Spec vs Acrobat creation (QuadPoints).

I Hope this answer help everyone who is facing the same problem.

// PDF32000-2008
// 12.5.2 Annotation Dictionaries
// 12.5.6 Annotation Types
// 12.5.6.10 Text Markup Annotations
@SuppressWarnings({ "unchecked", "unused" })
public ArrayList<String> getHighlightedText(String filePath, int pageNumber) throws IOException {
    ArrayList<String> highlightedTexts = new ArrayList<>();
    // this is the in-memory representation of the PDF document.
    // this will load a document from a file.
    PDDocument document = PDDocument.load(filePath);
    // this represents all pages in a PDF document.
    List<PDPage> allPages =  document.getDocumentCatalog().getAllPages();
    // this represents a single page in a PDF document.
    PDPage page = allPages.get(pageNumber);
    // get  annotation dictionaries
    List<PDAnnotation> annotations = page.getAnnotations();

    for(int i=0; i<annotations.size(); i++) {
        // check subType 
        if(annotations.get(i).getSubtype().equals("Highlight")) {
            // extract highlighted text
            PDFTextStripperByArea stripperByArea = new PDFTextStripperByArea();

            COSArray quadsArray = (COSArray) annotations.get(i).getDictionary().getDictionaryObject(COSName.getPDFName("QuadPoints"));
            String str = null;

            for(int j=1, k=0; j<=(quadsArray.size()/8); j++) {

                COSFloat ULX = (COSFloat) quadsArray.get(0+k);
                COSFloat ULY = (COSFloat) quadsArray.get(1+k);
                COSFloat URX = (COSFloat) quadsArray.get(2+k);
                COSFloat URY = (COSFloat) quadsArray.get(3+k);
                COSFloat LLX = (COSFloat) quadsArray.get(4+k);
                COSFloat LLY = (COSFloat) quadsArray.get(5+k);
                COSFloat LRX = (COSFloat) quadsArray.get(6+k);
                COSFloat LRY = (COSFloat) quadsArray.get(7+k);

                k+=8;

                float ulx = ULX.floatValue() - 1;                           // upper left x.
                float uly = ULY.floatValue();                               // upper left y.
                float width = URX.floatValue() - LLX.floatValue();          // calculated by upperRightX - lowerLeftX.
                float height = URY.floatValue() - LLY.floatValue();         // calculated by upperRightY - lowerLeftY.

                PDRectangle pageSize = page.getMediaBox();
                uly = pageSize.getHeight() - uly;

                Rectangle2D.Float rectangle_2 = new Rectangle2D.Float(ulx, uly, width, height);
                stripperByArea.addRegion("highlightedRegion", rectangle_2);
                stripperByArea.extractRegions(page);
                String highlightedText = stripperByArea.getTextForRegion("highlightedRegion");

                if(j > 1) {
                    str = str.concat(highlightedText);
                } else {
                    str = highlightedText;
                }
            }
            highlightedTexts.add(str);
        }
    }
    document.close();

    return highlightedTexts;
}
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!