pdfbox

PdfBox text extraction not working properly

最后都变了- 提交于 2020-01-05 12:34:07
问题 PDFTextStripper stripper = new PDFTextStripper(); PDDocument document = PDDocument.load(inputStream); String text = stripper.getText(document); Extracted text: http://pastebin.com/BXFfMy0z Problem pdf: http://www.iwb.ch/media/Unternehmen/Dokumente/inserat_leiter_pm.pdf What can I do to extract correct text from this pdf file? 回答1: In addition to @karthik27's answer: Adobe Reader is fairly good at text extraction and, therefore, generally can be used as an indicator whether text extraction

Add a watermark on a pdf that contains images using pdfbox (1.7)

白昼怎懂夜的黑 提交于 2020-01-05 08:41:58
问题 I have used the code suggested in: PDFBox Overlay fails to add a watermark to an existing pdf. Unfortunately, the pdf produced is corrupted. The pdf reader complains when I open the document: "An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem" . The document is opened but it does not show the images. It seems to happen with all the pdfs. It could be worth saying that it happens also with a

Need advice on checking signature/certificate of a signed pdf using java

陌路散爱 提交于 2020-01-05 05:41:08
问题 Several questions to the code below. googled, read javadoc import org.apache.pdfbox.io.IOUtils; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDDocumentCatalog; import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException; import org.apache.pdfbox.pdmodel.interactive.digitalsignature.PDSignature; import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm; import org.apache.pdfbox.pdmodel.interactive.form.PDField; import org.apache.pdfbox.pdmodel

Parse PDF table and display it as CSV(Java)

霸气de小男生 提交于 2020-01-05 05:26:25
问题 I am trying to parse a TABLE in PDF file and display it as CSV. I have attached sample data from PDF below(only few columns) and sample output for the same. Each column width is fixed, let's say Company Name(18 char),Amount(8 char), Type(5 char) etc. I tried using Itext and PDFBox jars to get each page data and parsed line by line, but sounds like it is not a great solution as the line breaks and page breaks in PDF are not proper. Please me let me know if there is any other appropriate

PDFBox Avoid Do you want to save changes before closing

痴心易碎 提交于 2020-01-05 05:10:21
问题 I am currently trying to add a button to an existing pdf page that upon clicking the button it closes the current tab. I have achieved that with the following code using PDFbox 2.0.15... try { InputStream pdfInput = new FileInputStream(new File("C:\\Users\\justi\\Desktop\\test\\real.pdf")); PDDocument doc = PDDocument.load(pdfInput); PDPage page = doc.getPage(0); // PDDocument doc = new PDDocument(); // PDPage page = new PDPage(); // doc.addPage(page); COSDictionary acroFormDict = new

Using PDFBox, how to set an appearance stream on annotation?

眉间皱痕 提交于 2020-01-05 05:01:52
问题 I am trying to highlight some text and convert it to image. I tried some stuff but the annotation did not came out on the image. Looking for help found this issue http://issues.apache.org/jira/browse/PDFBOX-2162 which said that I must set appearance-stream to the annotation, something that acrobat reader do it automatically, but when converting to image it is needed. I could not figure out how to set the appearance-stream to the annotation. looked for some examples on annotations and

Confusion about current transformation matrix in a PDF

被刻印的时光 ゝ 提交于 2020-01-05 03:49:23
问题 I am having some confusions about the current transformation matrix (CTM) in PDFs. For page 5 in this PDF, I have examined the Token Stream (http://pastebin.com/k6g4BGih) and that shows the last cm operation before the curve (c) commands sets the transfomration matrix to COSInt{10},COSInt{0},COSInt{0},COSInt{10},COSInt{0},COSInt{0} . The full output is at http://pastebin.com/9XaPQQm9 . Next I used the following set of codes to extract the line and curve commands from the same page following a

How to extract page number from PDF file [closed]

╄→гoц情女王★ 提交于 2020-01-04 17:34:27
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . We explored so many API's like tika,Pdfbox and itextpdf to extract page number from pdf file but we did not able to do this. In itextpdf we got PdfPageLabels.getPageLabels(reader) but the behaviour of this method is not uniform. 回答1: The reason why you don't find any software that

Filling landscape PDF with PDFBox

早过忘川 提交于 2020-01-04 15:29:58
问题 I try to fill a PDF form with PDFBox and I managed to do it well with a portrait oriented document. But I have a problem when filling a document in landscape mode. The fields are filled up, but the text orientation is not good. It appear vertically like if it was still in portrait but in a rotation of 90 degrees. Here is my simplified code: PDDocument pdfDoc = PDDocument.load(MY_FILE); PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog(); PDAcroForm acroForm = docCatalog.getAcroForm();