PDFBox - Line / Rectangle extraction

人盡茶涼 提交于 2019-12-06 09:34:34

As far as I understand the requirements here, the OP works in a coordinate system with the origin in the upper left corner of the visible page (taking the page rotation into account), x coordinates increasing to the right, y coordinates increasing downwards, and the units being the PDF default user space units (usually 1/72 inch).

In this coordinate system he needs to extract (horizontal or vertical) lines in the form of

  • coordinates of the left / top end point and
  • the width / height.

Transforming LineCatcher results

The helper class LineCatcher he got from Tilman, on the other hand, does not take page rotation into account. Furthermore, it returns the bottom end point for vertical lines, not the top end point. Thus, a coordinate transformation has to be applied to of the LineCatcher results.

For this simply replace

for(Rectangle2D rect:rectList) {
    String pageNum = Integer.toString(n + 1);
    String x = Double.toString(rect.getX());
    String y = Double.toString(page_height - rect.getY()) ;
    String w = Double.toString(rect.getWidth());
    String h = Double.toString(rect.getHeight());
    writeToFile(pageNum, x, y, w, h, osw);
}

by

int pageRotation = page.getRotation();
PDRectangle pageCropBox = page.getCropBox();

for(Rectangle2D rect:rectList) {
    String pageNum = Integer.toString(n + 1);
    String x, y, w, h;
    switch(pageRotation) {
    case 0:
        x = Double.toString(rect.getX() - pageCropBox.getLowerLeftX());
        y = Double.toString(pageCropBox.getUpperRightY() - rect.getY() + rect.getHeight());
        w = Double.toString(rect.getWidth());
        h = Double.toString(rect.getHeight());
        break;
    case 90:
        x = Double.toString(rect.getY() - pageCropBox.getLowerLeftY());
        y = Double.toString(rect.getX() - pageCropBox.getLowerLeftX());
        w = Double.toString(rect.getHeight());
        h = Double.toString(rect.getWidth());
        break;
    case 180:
        x = Double.toString(pageCropBox.getUpperRightX() - rect.getX() - rect.getWidth());
        y = Double.toString(rect.getY() - pageCropBox.getLowerLeftY());
        w = Double.toString(rect.getWidth());
        h = Double.toString(rect.getHeight());
        break;
    case 270:
        x = Double.toString(pageCropBox.getUpperRightY() - rect.getY() + rect.getHeight());
        y = Double.toString(pageCropBox.getUpperRightX() - rect.getX() - rect.getWidth());
        w = Double.toString(rect.getHeight());
        h = Double.toString(rect.getWidth());
        break;
    default:
        throw new IOException(String.format("Unsupported page rotation %d on page %d.", pageRotation, page));
    }
    writeToFile(pageNum, x, y, w, h, osw);
}

(ExtractLinesWithDir test testExtractLineRotationTestWithDir)

Relation to TextPosition.get?DirAdj() coordinates

The OP describes the coordinates by referring to the TextPosition class methods getXDirAdj() and getYDirAdj(). Indeed, these methods return coordinates in a coordinate system with the origin in the upper left page corner and y coordinates increasing downwards after rotating the page so that the text is drawn upright.

In case of the example document all the text is drawn so that it is upright after applying the page rotation. From this my understanding of the requirement written at the top has been derived.

The problem with using the TextPosition.get?DirAdj() values as coordinates globally, though, is that in documents with pages with text drawn in different directions, the collected text coordinates suddenly are relative to different coordinate systems. Thus, a general solution should not collect coordinates wildly like that. Instead it should determine a page orientation at first (e.g. the orientation given by the page rotation or the orientation shared by most of the text) and use coordinates in the fixed coordinate system given by that orientation plus an indication of the writing direction of the text piece in question.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!