Accessing “alternate text” for an image via PDFBox

问题

Is there some way to extract "alternate text" for a specific image using PDFBox?

I have a PDF file which, as described at http://www.w3.org/WAI/GL/2011/WD-WCAG20-TECHS-20110621/pdf.html#PDF1, has had alternate text added to an image. Using PDFBox I can find my way through the object model to the image itself (a PDXObjectImage) through PDFDocument.getDocumentCatalog().getAllPages() [iterator] .getResources.getImages() but I can not see any way to get from the image itself to the alternate text for it.

A small sample PDF (with a single image which has some alternate text specified) can be found at http://dl.dropbox.com/u/12253279/image_test_pass.pdf (It should say "This is the alternate text for the image.").

回答1:

I do not know how/if this can be done with PDFBox, but I can tell you that this feature is related to the sections of the PDF Spec called Logical Structutre/Tagged PDF, which is not fully supported in every PDF tool out-there.

Assuming it is supported by the tool you are using, you will have to follow 4 main steps to retrieve this information (I will use the sample PDF file you posted for the following explanation).

Assuming you have access to the internal structure of the PDF file, you will need to:

1- Parse the page content and find the MCID number of the Tag element that wraps the image you are interested in.

Page content:

BT
/P <</MCID 0 >>BDC 
/GS0 gs
/TT0 1 Tf
0.0004 Tc -0.0028 Tw 10.02 0 0 10.02 90 711 Tm
(This is an image test )Tj
EMC 
ET
/Figure <</MCID 1 >>BDC 
q
106.5 0 0 106.5 90 591.0599976 cm
/Im0 Do
Q
EMC

Your image: enter image description here

2- In the page object, retrieve the key StructParents. enter image description here

3- Now retrieve the Structure Tree (key StructTreeRoot of the Catalog object, which is the root object in every PDF file), and inside it, the ParentTree.

4- The ParentTree starts with an array where you can find pairs of elements (See Number Trees in the PDF Spec for more details). In this specific tree, the first element of each pair is a numeric value that corresponds to the StructParents key retrieved in step 2, and the second element is an array of objects, where the indexes correspond to the MCID values retreived in step 1. So, You will search here the element that corresponds to the MCID value of your image, and you will find a PDF object. Inside this object, you will find the alternate text.

enter image description here

Looks easy, isn't it?

Tools used in this answer:
PDF Vole (based on iText)
Amyuni PDF Analyzer

回答2:

Eric from the PDFBox mailing list sent me the following, though I've not tested it out yet...

Hi,

For your test file, here is a way to access "/Alt" entry :

    PDDocument document = PDDocument.load("image_test_pass.pdf");
    PDStructureTreeRoot treeRoot =
        document.getDocumentCatalog().getStructureTreeRoot();

    // get page for each StructElement
    for (Object o : treeRoot.getKids()) {
        if (o instanceof PDStructureElement) {
            PDStructureElement structElement = (PDStructureElement)o;
            System.out.println(structElement.getAlternateDescription());
            PDPage page = structElement.getPage();
            if (page != null) {
                page.getResources().getImages();
            }
        }
    }

Please refer to the PDF specification http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf and in particular §14.6, §14.7, §14.9.3 and §14.9.4 to know all the rules in order to find the "/Alt" entry. There seems to have several way to define this information.

BR, Eric

来源：https://stackoverflow.com/questions/12525883/accessing-alternate-text-for-an-image-via-pdfbox

标签

java

pdf

accessibility

pdfbox