Extract PDF form field names from a PDF form

前端 未结 6 1074
旧巷少年郎
旧巷少年郎 2020-12-29 05:24

I\'m using pdftk to fill in a PDF form with an XFDF file. However, for this project I do not know in advance what fields will be present, so I need to analyse the PDF itself

相关标签:
6条回答
  • 2020-12-29 05:42

    This worked for me:

     pdftk 1.pdf dump_data_fields output test2.txt
    

    Then when the file is encrypted with a password, this is how you can read from it

     pdftk 1.pdf input_pw YOUR_PASSWORD_GOES_HERE dump_data_fields output test2.txt
    

    This took me 2 hours to get right, so hopefully i save you some time :)

    0 讨论(0)
  • 2020-12-29 05:45

    I used the following code, using ABCpdf from WebSupergoo, but I imagine most libraries have comparable classes:

    protected void Button1_Click(object sender, EventArgs e)
        {
            Doc thedoc = new Doc();
            string saveFile = "~/docs/f1_filled.pdf";
            System.Text.StringBuilder sb = new System.Text.StringBuilder();
            thedoc.Read(Server.MapPath("~/docs/F1_2010.pdf"));
            foreach (Field fld in thedoc.Form.Fields)
            {
                if (!(fld.Page == null))
                {
                    sb.AppendFormat("Field: {0}, Type: {1},page: {4},x: {2},y: {3}\n", fld.Name, fld.FieldType.ToString(), fld.Rect.Left, fld.Rect.Top, fld.Page.PageNumber);
                }
                else
                {
                    sb.AppendFormat("Field: {0}, Type: {1},page: {4},x: {2},y: {3}\n", fld.Name, fld.FieldType.ToString(), fld.Rect.Left, fld.Rect.Top, "None");
                }
                if (fld.FieldType == FieldType.Text)
                {
                    fld.Value = fld.Name;
                }
    
            }
    
            this.TextBox1.Text = sb.ToString();
            this.TextBox1.Visible = true;
            thedoc.Save(Server.MapPath(saveFile));
            Response.Redirect(saveFile);
        }
    

    This does 2 things: 1) Populates a textbox with the inventory of all Form Fields, showing their name, fieldtype, and their page number and position on the page (0,0 is lower left, by the way). 2) Populates all the textfields with their field name in an output file - print the output file, and all of your text fields will be labelled.

    0 讨论(0)
  • 2020-12-29 05:52

    C# / ITextSharp

        public static void TracePdfFields(string pdfFilePath)
        {
            PdfReader pdfReader = new PdfReader(pdfFilePath);
            MemoryStream pdfStream = new MemoryStream();
            PdfStamper pdfStamper = new PdfStamper(pdfReader, pdfStream, '\0', true);
    
            int i = 1;
            foreach (var f in pdfStamper.AcroFields.Fields)
            {
                pdfStamper.AcroFields.SetField(f.Key, string.Format("{0} : {1}", i, f.Key));
                i++;
                //DoTrace("Field = [{0}] | Value = [{1}]", f.Key, f.Value.ToString());
            }
            pdfStamper.FormFlattening = false;
            pdfStamper.Writer.CloseStream = false;
            pdfStamper.Close();
    
            FileStream fs = File.OpenWrite(string.Format(@"{0}/{1}-TracePdfFields_{2}.pdf", 
                ConfigManager.GetInstance().LogConfig.Dir, 
                new FileInfo(pdfFilePath).Name, 
                DateTime.Now.Ticks));
    
            fs.Write(pdfStream.ToArray(), 0, (int)pdfStream.Length);
            fs.Flush();
            fs.Close();
        }
    
    0 讨论(0)
  • 2020-12-29 06:04

    Easy! You are using pdftk already

    # pdftk input.pdf dump_data_fields
    

    It will output Field name, field type, some of it's properties (like what are the options for dropdown list or text alignment) and even a Tooltip text (which I found to be extremely useful)

    The only thing I'm missing is field coordinates...

    0 讨论(0)
  • 2020-12-29 06:05

    A very late answer from me, though my solution is not PHP, but I hope it might come in handy should anyone is looking for a solution for Ruby.

    First is to use pdftk to extract all fields name out then we need to cleanup the dump text, to have a good readable hash:

    def extract_fields(filename)
      field_output = `pdftk #{filename} dump_data_fields 2>&1`
      @fields = field_output.split(/^---\n/).map do |field_text|
        if field_text =~ /^FieldName: (\w+)$/
          $1
        end
      end.compact.uniq
    end
    

    Second, now we can use any XML parse to construct our XFDF:

    # code borrowed from `nguyen` gem [https://github.com/joneslee85/nguyen]
    # generate XFDF content
    def to_xfdf(fields = {}, options = {})
      builder = Nokogiri::XML::Builder.new(:encoding => 'UTF-8') do |xml|
        xml.xfdf('xmlns' => 'http://ns.adobe.com/xfdf/', 'xml:space' => 'preserve') {
          xml.f(:href => options[:file]) if options[:file]
          xml.ids(:original => options[:id], :modified => options[:id]) if options[:id]
          xml.fields {
            fields.each do |field, value|
              xml.field(:name => field) {
                if value.is_a? Array
                  value.each { |item| xml.value(item.to_s) }
                else
                  xml.value(value.to_s)
                end
              }
            end
          }
        }
      end
      builder.to_xml
    end
    
    # write fdf content to path
    def save_to(path)
      (File.open(path, 'w') << to_xfdf).close
    end
    

    Viola, that's the main logic. I highly recommend you give nguyen (https://github.com/joneslee85/nguyen) gem a try if you are looking for a lightweight lib in Ruby.

    0 讨论(0)
  • 2020-12-29 06:06

    I can get my client to export the XFDF file (which contains field names) using Acrobat along with the PDF, which avoids this problem completely.

    0 讨论(0)
提交回复
热议问题