how to know if a field is on a particular page?

后端 未结 4 1900
梦谈多话
梦谈多话 2020-12-06 13:34

The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I\'m not sure which field

相关标签:
4条回答
  • 2020-12-06 14:13

    The PDFbox content stream is done per page, but the fields come from the form which comes from the catalog, which comes from the pdf doc itself. So I'm not sure which fields are on which pages

    The reason for this is that PDFs contain a global object structure defining the form. A form field in this structure may have 0, 1, or more visualizations on 0, 1, or more actual PDF pages. Furthermore, in case of only 1 visualization, a merge of field object and visualization object is allowed.

    PDFBox 1.8.x

    Unfortunately PDFBox in its PDAcroForm and PDField objects represents only this object structure and does not provide easy access to the associated pages. By accessing the underlying structures, though, you can build the connection.

    The following code should make clear how to do that:

    @SuppressWarnings("unchecked")
    public void printFormFields(PDDocument pdfDoc) throws IOException {
        PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog();
    
        List<PDPage> pages = docCatalog.getAllPages();
        Map<COSDictionary, Integer> pageNrByAnnotDict = new HashMap<COSDictionary, Integer>();
        for (int i = 0; i < pages.size(); i++) {
            PDPage page = pages.get(i);
            for (PDAnnotation annotation : page.getAnnotations())
                pageNrByAnnotDict.put(annotation.getDictionary(), i + 1);
        }
    
        PDAcroForm acroForm = docCatalog.getAcroForm();
    
        for (PDField field : (List<PDField>)acroForm.getFields()) {
            COSDictionary fieldDict = field.getDictionary();
    
            List<Integer> annotationPages = new ArrayList<Integer>();
            List<COSObjectable> kids = field.getKids();
            if (kids != null) {
                for (COSObjectable kid : kids) {
                    COSBase kidObject = kid.getCOSObject();
                    if (kidObject instanceof COSDictionary)
                        annotationPages.add(pageNrByAnnotDict.get(kidObject));
                }
            }
    
            Integer mergedPage = pageNrByAnnotDict.get(fieldDict);
    
            if (mergedPage == null)
                if (annotationPages.isEmpty())
                    System.out.printf("i Field '%s' not referenced (invisible).\n", field.getFullyQualifiedName());
                else
                    System.out.printf("a Field '%s' referenced by separate annotation on %s.\n", field.getFullyQualifiedName(), annotationPages);
            else
                if (annotationPages.isEmpty())
                    System.out.printf("m Field '%s' referenced as merged on %s.\n", field.getFullyQualifiedName(), mergedPage);
                else
                    System.out.printf("x Field '%s' referenced as merged on %s and by separate annotation on %s. (Not allowed!)\n", field.getFullyQualifiedName(), mergedPage, annotationPages);
        }
    }
    

    Beware, there are two shortcomings in the PDFBox PDAcroForm form field handling:

    1. The PDF specification allows the global object structure defining the form to be a deep tree, i.e. the actual fields do not have to be direct children of the root but may be organized by means of inner tree nodes. PDFBox ignores this and expects the fields to be direct children of the root.

    2. Some PDFs in the wild, foremost older ones, do not contain the field tree but only reference the field objects from the pages via the visualizing widget annotations. PDFBox does not see these fields in its PDAcroForm.getFields list.

    PS: @mikhailvs in his answer correctly shows that you can retrieve a page object from a field widget using PDField.getWidget().getPage() and determine its page number using catalog.getAllPages().indexOf. While being fast this getPage() method has a drawback: It retrieves the page reference from an optional entry of the widget annotation dictionary. Thus, if the PDF you process has been created by software that fills that entry, all is well, but if the PDF creator has not filled that entry, all you get is a null page.

    PDFBox 2.0.x

    In 2.0.x some methods for accessing the elements in question have changed but not the situation as a whole, to safely retrieve the page of a widget you still have to iterate through the pages and find a page that references the annotation.

    The safe methods:

    int determineSafe(PDDocument document, PDAnnotationWidget widget) throws IOException
    {
        COSDictionary widgetObject = widget.getCOSObject();
        PDPageTree pages = document.getPages();
        for (int i = 0; i < pages.getCount(); i++)
        {
            for (PDAnnotation annotation : pages.get(i).getAnnotations())
            {
                COSDictionary annotationObject = annotation.getCOSObject();
                if (annotationObject.equals(widgetObject))
                    return i;
            }
        }
        return -1;
    }
    

    The fast method

    int determineFast(PDDocument document, PDAnnotationWidget widget)
    {
        PDPage page = widget.getPage();
        return page != null ? document.getPages().indexOf(page) : -1;
    }
    

    Usage:

    PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
    if (acroForm != null)
    {
        for (PDField field : acroForm.getFieldTree())
        {
            System.out.println(field.getFullyQualifiedName());
            for (PDAnnotationWidget widget : field.getWidgets())
            {
                System.out.print(widget.getAnnotationName() != null ? widget.getAnnotationName() : "(NN)");
                System.out.printf(" - fast: %s", determineFast(document, widget));
                System.out.printf(" - safe: %s\n", determineSafe(document, widget));
            }
        }
    }
    

    (DetermineWidgetPage.java)

    (In contrast to the 1.8.x code the safe method here simply searches for the page of a single field. If in your code you have to determine the page of many widgets, you should create a lookup Map like in the 1.8.x case.)

    Example documents

    A document for which the fast method fails: aFieldTwice.pdf

    A document for which the fast method works: test_duplicate_field2.pdf

    0 讨论(0)
  • 2020-12-06 14:20

    Granted this answer may not help the OP (a year later), but if someone else runs into it, here is the solution:

    PDDocumentCatalog catalog = doc.getDocumentCatalog();
    
    int pageNumber = catalog.getAllPages().indexOf(yourField.getWidget().getPage());
    
    0 讨论(0)
  • 2020-12-06 14:22

    General solution for single or multiple widget of (duplicate qualified name of single page)..

    List<PDAnnotationWidget>  widget=field.getWidgets();
    PDDocumentCatalog catalog = doc.getDocumentCatalog();
    for(int i=0;i<widget.size();i++) {
    int pageNumber = 1+ catalog.getPages().indexOf(field.getWidgets().get(i).getPage());
    

    /* field co ordinate also can get here for single or multiple both it will work..*/

    //PDRectangle r= widget.get(i).getRectangle();

    }
    
    0 讨论(0)
  • 2020-12-06 14:26

    This example uses Lucee (cfml) https://lucee.org/

    A big thank you to mkl as his answer above is invaluable and I couldn't have built this function without his help.

    Call the function: pageForSignature(doc, fieldName) and it will return the page no that the fieldname resides on. Returns -1 if fieldName not found.

      <cfscript>
      try{
    
      /*
      java is used by using CreateObject()
      */
    
      variables.File = CreateObject("java", "java.io.File");
    
      //references lucee bundle directory - typically on tomcat: /usr/local/tomcat/lucee-server/bundles
      variables.PDDocument = CreateObject("java", "org.apache.pdfbox.pdmodel.PDDocument", "org.apache.pdfbox.app", "2.0.18")
    
      function determineSafe(doc, widget){
    
        var i = '';
        var widgetObject = widget.getCOSObject();
        var pages = doc.getPages();
        var annotation = '';
        var annotationObject = '';
    
        for (i = 0; i < pages.getCount(); i=i+1){
    
        for (annotation in pages.get(i).getAnnotations()){
            if(annotation.getSubtype() eq 'widget'){
                annotationObject = annotation.getCOSObject();
                if (annotationObject.equals(widgetObject)){
                    return i;
                }
            }
        }
    
        }
        return -1;
      }
    
      function pageForSignature(doc, fieldName){
        try{
        var acroForm = doc.getDocumentCatalog().getAcroForm();
        var field = '';
        var widget = '';
        var annotation = '';
        var pageNo = '';
    
        for(field in acroForm.getFields()){
    
        if(field.getPartialName() == fieldName){
    
            for(widget in field.getWidgets()){
    
               for(annotation in widget.getPage().getAnnotations()){
    
                 if(annotation.getSubtype() == 'widget'){
    
                    pageNo = determineSafe(doc, widget);
                    doc.close();
                    return pageNo;
                 }
               }
    
            }
        }
      }
    return -1;  
    }catch(e){
        doc.close();
    writeDump(label="catch error",var='#e#');
      }
    } 
    
    doc = PDDocument.init().load(File.init('/**********/myfile.pdf'));
    
    //returns no,  page numbers start at 0
    pageNo = pageForSignature(doc, 'twtzceuxvx');
    
    writeDump(label="pageForSignature(doc, fieldName)", var="#pageNo#");
    </cfscript
    
    0 讨论(0)
提交回复
热议问题