PDFBOX 2.0.18 - How to iterates through pages of a PDF and retrieve specific fields

问题

I'm using PDFBox to read specific fields on a pdf document. Actually, I'm able to get all the informations I want with a pdf containing only one page. The PDF has fields with specific names and I can get all the fields and insert it in a database.

I use this code with AccroForm to access the fields

InputStream document = item.getInputStream();
pdf = PDDocument.load(new RandomAccessBufferedFileInputStream(document));
pdCatalog = pdf.getDocumentCatalog();
pdAcroForm = pdCatalog.getAcroForm();

String dateRapport = pdAcroForm.getField("import_Date01").getValueAsString();
String radioReason = pdAcroForm.getField("NoFlight").getValueAsString();
boolean hasdata = false;

if(radioRaison.length() > 0 && !radioRaison.equals("Off")) {
    if(radioRaison.equals("NR")) {
        rvhi.setRaison(obtenirRaison(raisons, "NR"));
    }else if(radioRaison.equals("WX")) {
        rvhi.setRaison(obtenirRaison(raisons, "ME"));
    }else if(radioRaison.equals("US")) {
        rvhi.setRaison(obtenirRaison(raisons, "BR"));
    }
}
if(pdAcroForm.getField("import_Hmn0"+indexEnString).getValueAsString().length() > 0) 
{
    hasdata = true
}

pdf.close();

return hasdata;

Now, my problem is to do the same thing with a pdf that contains multiple identical pages with the same field names, but with different data in the fields. I would like to iterate through each pages and call the same method and retrieve the fields data on each page.

I use this code below to iterate through pages of the pdf, but I don't know how to get the fields on the current page... I don't know how to get the acroform fields from the PDPage object?

PDPageTree nbPages = pdf.getPages();

if(nbPages.getCount() > 1) {
    for(PDPage page : nbPages) {
        ???? how to get fields Acroform from PDPage page ???
    }
}

Thanks in advance for your responses!

回答1:

There is no such thing as a list of PDField objects for the current page; an AcroForm is document wide. So the first part of your question already gets the full list of fields in the document. (12.7.1 in the PDF Specification from Adobe)

Fields can have the same fully qualified name, but then their values also have to be the same. (12.7.3.2 in the PDF Specification)

What probably happens in your document is that the partial name of the field is the same, but the fully qualified name isn't the same. The fully qualified name is formed by concatenating the name of the field and the name of the ancestor objects, as in "parent partial name"."child partial name".

So basically you'll have to use the fully qualified name to find the field, or you need to iterate over the list of fields to find all fields you have in the document.

You could find the page on which a particular field is displayed as a field uses annotations (widget annotations) to show itself on a page. These annotations do live in an Annots array on the page level. Whether there is a convenience function in pdfbox to do this easily, I don't know.

回答2:

Sorry for the late response... Thank you @DavidvanDriessche. To find the composition of the fullyQualifiedName, I used a small function to list all fields and their childs node if they have one. It turns out that for the second page of the document, the page number was specified as the parent partial name. For example, the first page have, "fieldNameExample.fieldNameExmaple" as fully qualified name and the second page have "1.fieldNameExample" as fully qualified name. So I can assume that for every subsequent pages, it will be the page number.fieldNameExample as the fully qualified name.

Thanks everyone for your help!

来源：https://stackoverflow.com/questions/62583973/pdfbox-2-0-18-how-to-iterates-through-pages-of-a-pdf-and-retrieve-specific-fie

标签

java

pdf

pdfbox