PDF document manipulation

拈花ヽ惹草 提交于 2019-12-11 12:19:41

问题


I have several PDFs with the following properties:

Each PDF contains a variable number of "documents" with differing number of pages.

Each page in a "document" has text such as "Page 3 of 26".

I want to be able to automatically identify the first and last page of each "document" within a PDF (Note: this is not the same as the first and last page of a PDF as each PDF may contain several "documents") and extract these into a new PDF for later printing and archival.

I'm not sure what tools I can bring to bear on this problem and what libraries are available to tackle this.

Any recommendations? Preferably free and can be used to create a tool that will run on Windows.


回答1:


Java has a nice free pdf library. Check out iText.

From iText's site:

You can use iText to:

  • Serve PDF to a browser
  • Generate dynamic documents from XML files or databases
  • Use PDF's many interactive features
  • Add bookmarks, page numbers, watermarks, etc.
  • Split, concatenate, and manipulate PDF pages
  • Automate filling out of PDF forms
  • Add digital signatures to a PDF file
  • And much more...

Since it's Java, there should be no issues running on Windows, or anywhere else for that matter.




回答2:


You can try using pdftk to decompress the PDF, parse the data, split it, and then recompress it.




回答3:


I managed to come up with a horrible unix hack that will work:

  • use pdftk to decompress and explode into separate pages
  • use pdftotext to convert each page into text
  • write a script to identify the appropriate string in the txt and copy the corresponding pdf into a sub-directory [in progress]
  • find some tool to recombine [to be investigated, probably pdftk can do]

Should work on my unix platform but not sure if it is acceptable to bring all these tools onto the windows environment.

One potential is to use an email gateway to receive pdfs and return processed pdf which makes it even more ugly.

Anyone with a native win32 solution?



来源:https://stackoverflow.com/questions/730613/pdf-document-manipulation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!