问题
I am trying to extract text coordinates and line (or rectangle) coordinates from a PDF.
The TextPosition
class has getXDirAdj()
and getYDirAdj()
methods which transform coordinates according to the direction of the text piece the respective TextPosition object represents (Corrected based on comment from @mkl)
The final output is consistent, irrespective of the page rotation.
The coordinates needed on the output are X0,Y0 (TOP LEFT CORNER OF THE PAGE)
This is a slight modification from the solution by @Tilman Hausherr. The y coordinates are inverted (height - y) to keep it consistent with the coordinates from the text extraction process, also the output is written to a csv.
public class LineCatcher extends PDFGraphicsStreamEngine
{
private static final GeneralPath linePath = new GeneralPath();
private static ArrayList<Rectangle2D> rectList= new ArrayList<Rectangle2D>();
private int clipWindingRule = -1;
private static String headerRecord = "Text|Page|x|y|width|height|space|font";
public LineCatcher(PDPage page)
{
super(page);
}
public static void main(String[] args) throws IOException
{
if( args.length != 4 )
{
usage();
}
else
{
PDDocument document = null;
FileOutputStream fop = null;
File file;
Writer osw = null;
int numPages;
double page_height;
try
{
document = PDDocument.load( new File(args[0], args[1]) );
numPages = document.getNumberOfPages();
file = new File(args[2], args[3]);
fop = new FileOutputStream(file);
// if file doesnt exists, then create it
if (!file.exists()) {
file.createNewFile();
}
osw = new OutputStreamWriter(fop, "UTF8");
osw.write(headerRecord + System.lineSeparator());
System.out.println("Line Processing numPages:" + numPages);
for (int n = 0; n < numPages; n++) {
System.out.println("Line Processing page:" + n);
rectList = new ArrayList<Rectangle2D>();
PDPage page = document.getPage(n);
page_height = page.getCropBox().getUpperRightY();
LineCatcher lineCatcher = new LineCatcher(page);
lineCatcher.processPage(page);
try{
for(Rectangle2D rect:rectList) {
String pageNum = Integer.toString(n + 1);
String x = Double.toString(rect.getX());
String y = Double.toString(page_height - rect.getY()) ;
String w = Double.toString(rect.getWidth());
String h = Double.toString(rect.getHeight());
writeToFile(pageNum, x, y, w, h, osw);
}
rectList = null;
page = null;
lineCatcher = null;
}
catch(IOException io){
throw new IOException("Failed to Parse document for line processing. Incorrect document format. Page:" + n);
}
};
}
catch(IOException io){
throw new IOException("Failed to Parse document for line processing. Incorrect document format.");
}
finally
{
if ( osw != null ){
osw.close();
}
if( document != null )
{
document.close();
}
}
}
}
private static void writeToFile(String pageNum, String x, String y, String w, String h, Writer osw) throws IOException {
String c = "^" + "|" +
pageNum + "|" +
x + "|" +
y + "|" +
w + "|" +
h + "|" +
"999" + "|" +
"marker-only";
osw.write(c + System.lineSeparator());
}
@Override
public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException
{
// to ensure that the path is created in the right direction, we have to create
// it by combining single lines instead of creating a simple rectangle
linePath.moveTo((float) p0.getX(), (float) p0.getY());
linePath.lineTo((float) p1.getX(), (float) p1.getY());
linePath.lineTo((float) p2.getX(), (float) p2.getY());
linePath.lineTo((float) p3.getX(), (float) p3.getY());
// close the subpath instead of adding the last line so that a possible set line
// cap style isn't taken into account at the "beginning" of the rectangle
linePath.closePath();
}
@Override
public void drawImage(PDImage pdi) throws IOException
{
}
@Override
public void clip(int windingRule) throws IOException
{
// the clipping path will not be updated until the succeeding painting operator is called
clipWindingRule = windingRule;
}
@Override
public void moveTo(float x, float y) throws IOException
{
linePath.moveTo(x, y);
}
@Override
public void lineTo(float x, float y) throws IOException
{
linePath.lineTo(x, y);
}
@Override
public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException
{
linePath.curveTo(x1, y1, x2, y2, x3, y3);
}
@Override
public Point2D getCurrentPoint() throws IOException
{
return linePath.getCurrentPoint();
}
@Override
public void closePath() throws IOException
{
linePath.closePath();
}
@Override
public void endPath() throws IOException
{
if (clipWindingRule != -1)
{
linePath.setWindingRule(clipWindingRule);
getGraphicsState().intersectClippingPath(linePath);
clipWindingRule = -1;
}
linePath.reset();
}
@Override
public void strokePath() throws IOException
{
rectList.add(linePath.getBounds2D());
linePath.reset();
}
@Override
public void fillPath(int windingRule) throws IOException
{
linePath.reset();
}
@Override
public void fillAndStrokePath(int windingRule) throws IOException
{
linePath.reset();
}
@Override
public void shadingFill(COSName cosn) throws IOException
{
}
/**
* This will print the usage for this document.
*/
private static void usage()
{
System.err.println( "Usage: java " + LineCatcher.class.getName() + " <input-pdf>" + " <output-file>");
}
}
Was using the PDFGraphicsStreamEngine
class to extract Line and Rectangle coordinates. The coordinates of lines and rectangles do not align with the coordinates of the text
Green: Text Red: Line coordinates obtained as is Black: Expected coordinates (Obtained after applying transformation on the output)
Tried the setRotation()
method to correct for the rotation before running the line extract. However the results are not consistent.
What are the possible options to get the rotation and get a consistent output of the Line / Rectangle coordinates using PDFBox?
回答1:
As far as I understand the requirements here, the OP works in a coordinate system with the origin in the upper left corner of the visible page (taking the page rotation into account), x coordinates increasing to the right, y coordinates increasing downwards, and the units being the PDF default user space units (usually 1/72 inch).
In this coordinate system he needs to extract (horizontal or vertical) lines in the form of
- coordinates of the left / top end point and
- the width / height.
Transforming LineCatcher
results
The helper class LineCatcher
he got from Tilman, on the other hand, does not take page rotation into account. Furthermore, it returns the bottom end point for vertical lines, not the top end point. Thus, a coordinate transformation has to be applied to of the LineCatcher
results.
For this simply replace
for(Rectangle2D rect:rectList) {
String pageNum = Integer.toString(n + 1);
String x = Double.toString(rect.getX());
String y = Double.toString(page_height - rect.getY()) ;
String w = Double.toString(rect.getWidth());
String h = Double.toString(rect.getHeight());
writeToFile(pageNum, x, y, w, h, osw);
}
by
int pageRotation = page.getRotation();
PDRectangle pageCropBox = page.getCropBox();
for(Rectangle2D rect:rectList) {
String pageNum = Integer.toString(n + 1);
String x, y, w, h;
switch(pageRotation) {
case 0:
x = Double.toString(rect.getX() - pageCropBox.getLowerLeftX());
y = Double.toString(pageCropBox.getUpperRightY() - rect.getY() + rect.getHeight());
w = Double.toString(rect.getWidth());
h = Double.toString(rect.getHeight());
break;
case 90:
x = Double.toString(rect.getY() - pageCropBox.getLowerLeftY());
y = Double.toString(rect.getX() - pageCropBox.getLowerLeftX());
w = Double.toString(rect.getHeight());
h = Double.toString(rect.getWidth());
break;
case 180:
x = Double.toString(pageCropBox.getUpperRightX() - rect.getX() - rect.getWidth());
y = Double.toString(rect.getY() - pageCropBox.getLowerLeftY());
w = Double.toString(rect.getWidth());
h = Double.toString(rect.getHeight());
break;
case 270:
x = Double.toString(pageCropBox.getUpperRightY() - rect.getY() + rect.getHeight());
y = Double.toString(pageCropBox.getUpperRightX() - rect.getX() - rect.getWidth());
w = Double.toString(rect.getHeight());
h = Double.toString(rect.getWidth());
break;
default:
throw new IOException(String.format("Unsupported page rotation %d on page %d.", pageRotation, page));
}
writeToFile(pageNum, x, y, w, h, osw);
}
(ExtractLinesWithDir test testExtractLineRotationTestWithDir
)
Relation to TextPosition.get?DirAdj()
coordinates
The OP describes the coordinates by referring to the TextPosition
class methods getXDirAdj()
and getYDirAdj()
. Indeed, these methods return coordinates in a coordinate system with the origin in the upper left page corner and y coordinates increasing downwards after rotating the page so that the text is drawn upright.
In case of the example document all the text is drawn so that it is upright after applying the page rotation. From this my understanding of the requirement written at the top has been derived.
The problem with using the TextPosition.get?DirAdj()
values as coordinates globally, though, is that in documents with pages with text drawn in different directions, the collected text coordinates suddenly are relative to different coordinate systems. Thus, a general solution should not collect coordinates wildly like that. Instead it should determine a page orientation at first (e.g. the orientation given by the page rotation or the orientation shared by most of the text) and use coordinates in the fixed coordinate system given by that orientation plus an indication of the writing direction of the text piece in question.
来源:https://stackoverflow.com/questions/55166990/pdfbox-line-rectangle-extraction