PDFBox 1.8 PrintTextLocations wrong TextPosition height for a multi page pdf

问题

I am running the example provided with PDFBox to get the width/height of each TextPosition. When I pass a one page pdf it gives me accurate results. But if I use a multi page pdf I am getting incorrect height.

This is the experiment I did, I took a 5 page pdf and passed in as argument (got wrong height for each TextPosition). Next I split the same pdf into 5 single page pdfs using MacOSX Preview and passed each page one by one (I get correct height).

package printtextlocations;

import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

import java.io.IOException;
import java.util.List;

public class PrintTextLocations extends PDFTextStripper {

    public PrintTextLocations() throws IOException {
        super.setSortByPosition(true);
    }

    public static void main(String[] args) throws Exception {

        PDDocument document = null;
        try {
            File input = new File("C:\\path\\to\\PDF.pdf");
            document = PDDocument.load(input);
            if (document.isEncrypted()) {
                try {
                    document.decrypt("");
                } catch (InvalidPasswordException e) {
                    System.err.println("Error: Document is encrypted with a password.");
                    System.exit(1);
                }
            }
            PrintTextLocations printer = new PrintTextLocations();
            List allPages = document.getDocumentCatalog().getAllPages();
            for (int i = 0; i < allPages.size(); i++) {
                PDPage page = (PDPage) allPages.get(i);
                System.out.println("Processing page: " + i);
                PDStream contents = page.getContents();
                if (contents != null) {
                    printer.processStream(page, page.findResources(), page.getContents().getStream());
                }
            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }

    /**
     * @param text The text to be processed
     */
    @Override
    protected void processTextPosition(TextPosition text) {
        System.out.println(" String [x: " + text.getXDirAdj() + ", y: "
            + text.getY() + ", height:" + text.getHeightDir()
            + ", space: " + text.getWidthOfSpace() + ", width: "
            + text.getWidthDirAdj() + ", yScale: " + text.getYScale() + "]"
            + text.getCharacter());
    }
}

Output snippet - 5 page pdf

String [x: 58.500004, y: 692.2, height:33.480003, space: 2.64, width: 6.635998, yScale: 12.0]6

String [x: 58.6, y: 741.2, height:33.480003, space: 2.64, width: 6.6360016, yScale: 12.0]1

String [x: 58.6, y:753.4, height:33.480003, space: 2.64, width: 6.6360016, yScale: 12.0]2

Output snipper - 1 page pdfs

String [x: 58.5, y: 692.2, height:5.55, space: 2.64, width: 6.6480026, yScale: 12.0]6

String [x: 58.6, y: 741.2, height:5.55, space: 2.64, width: 6.6480026, yScale: 12.0]1

String [x: 58.6, y: 753.4, height:5.55, space: 2.64, width: 6.6480026, yScale: 12.0]2

Does anyone know why we get inconsistent results in this case ? Is there any setting I am missing ?

Thanks for the help.

Here's another test file wrong height pdf - 3 pages and here the output I get

String [x: 90.0, y: 83.28003, height:33.480003, space: 5.8497605, width: 7.248001, yScale: 12.0]V

String [x: 97.242, y: 83.28003, height:33.480003, space: 5.8497605, width: 5.856003, yScale: 12.0]e

String [x: 103.095604, y: 83.28003, height:33.480003, space: 5.8497605, width:4.9680023,yScale:12.0]r

String [x: 108.0588, y: 83.28003, height:33.480003, space: 5.8497605, width: 6.0479965, yScale:12.0]y

String [x: 116.748, y: 83.28003, height:33.480003, space: 5.8497605, width: 5.9520035, yScale: 12.0]S

String [x: 122.7012, y: 83.28003, height:33.480003, space: 5.8497605, width: 3.3359985, yScale:12.0]i

String [x: 126.034805, y: 83.28003, height:33.480003, space: 5.8497605, width: 9.983994,yScale:12.0]m

String [x: 136.01881, y: 83.28003, height:33.480003, space: 5.8497605, width: 6.671997, yScale:12.0]p

String [x: 142.6932, y: 83.28003, height:33.480003, space: 5.8497605, width: 3.251999, yScale: 12.0]l

String [x: 145.9512, y: 83.28003, height:33.480003, space: 5.8497605, width: 5.856003, yScale: 12.0]e

String [x: 154.4472, y: 83.28003, height:33.480003, space: 5.8497605, width: 7.9440002, yScale:12.0]D

String [x: 162.38641, y: 83.28003, height:33.480003, space: 5.8497605, width: 6.371994, yScale:12.0]o

String [x: 168.75601, y: 83.28003, height:33.480003, space: 5.8497605, width: 5.2920074, yScale: 12.0]c String [x: 174.0468, y: 83.28003, height:33.480003, space: 5.8497605, width: 6.624008, yScale: 12.0]u String [x: 180.6732, y: 83.28003, height:33.480003, space: 5.8497605, width: 9.983994, yScale: 12.0]m String [x: 190.6572, y: 83.28003, height:33.480003, space: 5.8497605, width: 5.856003, yScale: 12.0]e String [x: 196.5108, y: 83.28003, height:33.480003, space: 5.8497605, width: 6.695999, yScale: 12.0]n String [x: 203.20801, y: 83.28003, height:33.480003, space: 5.8497605, width: 4.0559998, yScale: 12.0]t done processing page 0 done add page 0 String [x: 90.0, y: 139.44, height:33.480003, space: 5.8497605, width: 6.816002, yScale: 12.0]P

String [x: 96.8148, y: 139.44, height:33.480003, space: 5.8497605, width: 5.856003, yScale: 12.0]a

String [x: 102.6696, y: 139.44, height:33.480003, space: 5.8497605, width: 5.9280014, yScale: 12.0]g

String [x: 108.5964, y: 139.44, height:33.480003, space: 5.8497605, width: 5.856003, yScale: 12.0]e

String [x: 117.090004, y: 139.44, height:33.480003, space: 5.8497605, width: 6.6480026, yScale:12.0]2

String [x: 126.375595, y: 139.44, height:33.480003, space: 5.8497605, width: 6.371994, yScale: 12.0]o

String [x: 132.7464, y: 139.44, height:33.480003, space: 5.8497605, width: 3.6360016, yScale: 12.0]f

String [x: 139.0312, y: 139.44, height:33.480003, space: 5.8497605, width: 9.983994, yScale: 12.0]m

String [x: 149.0152, y: 139.44, height:33.480003, space: 5.8497605, width: 3.3359985, yScale: 12.0]i

String [x: 152.3488, y: 139.44, height:33.480003, space: 5.8497605, width: 6.695999, yScale: 12.0]n

String [x: 159.046, y: 139.44, height:33.480003, space: 5.8497605, width: 3.3359985, yScale: 12.0]i

String [x: 162.37961, y: 139.44, height:33.480003, space: 5.8497605, width: 9.983994, yScale: 12.0]m

String [x: 172.3636, y: 139.44, height:33.480003, space: 5.8497605, width: 5.856003, yScale: 12.0]a

String [x: 178.2232, y: 139.44, height:33.480003, space: 5.8497605, width: 3.251999, yScale: 12.0]l

String [x: 181.4812, y: 139.44, height:33.480003, space: 5.8497605, width: 3.3359985, yScale: 12.0]i

String [x: 184.8148, y: 139.44, height:33.480003, space: 5.8497605, width: 5.1600037, yScale: 12.0]s

String [x: 189.9712, y: 139.44, height:33.480003, space: 5.8497605, width: 9.983994, yScale: 12.0]m

done processing page 1 done add page 1 String [x: 90.0, y: 266.15997, height:33.480003, space: 5.8497605, width: 6.816002, yScale: 12.0]P

String [x: 96.8148, y: 266.15997, height:33.480003, space: 5.8497605, width: 5.856003, yScale:12.0]a

String [x: 102.6696, y: 266.15997, height:33.480003, space: 5.8497605, width: 5.9280014,yScale:12.0]g

String [x: 108.5964, y: 266.15997, height:33.480003, space: 5.8497605, width: 5.856003, yScale:12.0]e

String [x: 117.090004, y: 266.15997, height:33.480003, space: 5.8497605,width:6.6480026,yScale:12.0]3

String [x: 126.375595, y: 266.15997, height:33.480003, space: 5.8497605, width:6.371994,yScale:12.0]o

String [x: 132.7464, y: 266.15997, height:33.480003, space: 5.8497605, width: 7.548004,yScale:12.0]K

String [x: 140.3052, y: 266.15997, height:33.480003, space: 5.8497605, width: 5.856003,yScale:12.0]a

String [x: 146.16, y: 266.15997, height:33.480003, space: 5.8497605, width: 6.048004, yScale: 12.0]y

String [x: 152.2068, y: 266.15997, height:33.480003, space: 5.8497605, width: 5.0639954,yScale:12.0]?

done processing page 2 done add page 2

回答1:

When determining the height of a parsed glyph (using the getFontHeight method of the font object in question), PDFBox first checks whether it has font metrics for individual glyphs at hand. It only knows AFM type 1 font metrics here; as your font is a true type font, therefore, PDFBox does not have such metrics.

In such a case it continues by trying to retrieve general font metrics from the font descriptor. The font descriptor of the font in your document looks like this:

21 0 obj <<
    /Type /FontDescriptor
    /FontName /GLDXOZ+Cambria
    /Flags 4
    /FontBBox [-1475 -2463 2867 3117]
    /ItalicAngle 0
    /Ascent 950
    /Descent -222
    /CapHeight 667
    /StemV 0
    /XHeight 467
    /AvgWidth 615
    /MaxWidth 2919
    /FontFile2 24 0 R
>>
endobj

The first descriptor entry it inspects is the font bounding box (/FontBBox entry), and if it is present, it takes half its heigth as average font height.

In your case the font bounding box is very very big compared to the glyphs in the font; vertically it goes from -2463 up to 3117!!

On the other hand, the capital letter height (/CapHeight entry, the vertical coordinate of the top of flat capital letters, measured from the baseline) is merely 667, and the ascent (/Ascent, the maximum height above the baseline reached by glyphs in this font; the height of glyphs for accented characters is excluded) only 950. This really makes me wonder why that font has such a font bounding box...

If there wasn't a font bounding box, PDFBox would next have tried using the capital letter height, then the ascent, and eventually /XHeight - /Descent. Each of these would have resulted in a reasonable value, but as there is that bounding box, PDFBox assumes a much too large value.

The code in question is commented as

// the following values are all more or less accurate
// at least all are average values. Maybe we'll find
// another way to get those value for every single glyph
// in the future if needed

While I don't know why PDFBox prefers to guess the average height from the bounding box instead of e.g. the ascent, it is not the only software assuming text in that font of yours to be gigantic. If for example you use the text touchup tool of Adobe Acrobat, you see this:

The vertical bar is the cursor! So Acrobat, too, thinks the font is huge.

Unfortunately you have not provided the single page pdfs created from your sample by splitting with MacOSX Preview. Thus I don't know why you get more realistical information thereafter. Obviously, though, Preview somehow changes the font information as the cause for the giant height values has nothing to do with the document having multiple pages or only a single one.

来源：https://stackoverflow.com/questions/16579146/pdfbox-1-8-printtextlocations-wrong-textposition-height-for-a-multi-page-pdf

标签

pdf

pdfbox