pdfbox | 易学教程

PdfBox text extraction not working properly

阅读更多关于 PdfBox text extraction not working properly

问题 PDFTextStripper stripper = new PDFTextStripper(); PDDocument document = PDDocument.load(inputStream); String text = stripper.getText(document); Extracted text: http://pastebin.com/BXFfMy0z Problem pdf: http://www.iwb.ch/media/Unternehmen/Dokumente/inserat_leiter_pm.pdf What can I do to extract correct text from this pdf file? 回答1: In addition to @karthik27's answer: Adobe Reader is fairly good at text extraction and, therefore, generally can be used as an indicator whether text extraction

Add a watermark on a pdf that contains images using pdfbox (1.7)

阅读更多关于 Add a watermark on a pdf that contains images using pdfbox (1.7)

问题 I have used the code suggested in: PDFBox Overlay fails to add a watermark to an existing pdf. Unfortunately, the pdf produced is corrupted. The pdf reader complains when I open the document: "An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem" . The document is opened but it does not show the images. It seems to happen with all the pdfs. It could be worth saying that it happens also with a

PDFPage setBounds is blurry and deform

阅读更多关于 PDFPage setBounds is blurry and deform

问题 The page I get from page.setBounds is blurry and deform. but the actual page (even after zooming) is very clear.I want to maintain pdf quality after setBounds. myPage.setBounds(box.bounds, for: .cropBox) 来源： https://stackoverflow.com/questions/58410267/pdfpage-setbounds-is-blurry-and-deform

Need advice on checking signature/certificate of a signed pdf using java

阅读更多关于 Need advice on checking signature/certificate of a signed pdf using java

问题 Several questions to the code below. googled, read javadoc import org.apache.pdfbox.io.IOUtils; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDDocumentCatalog; import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException; import org.apache.pdfbox.pdmodel.interactive.digitalsignature.PDSignature; import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm; import org.apache.pdfbox.pdmodel.interactive.form.PDField; import org.apache.pdfbox.pdmodel

Parse PDF table and display it as CSV(Java)

阅读更多关于 Parse PDF table and display it as CSV(Java)

问题 I am trying to parse a TABLE in PDF file and display it as CSV. I have attached sample data from PDF below(only few columns) and sample output for the same. Each column width is fixed, let's say Company Name(18 char),Amount(8 char), Type(5 char) etc. I tried using Itext and PDFBox jars to get each page data and parsed line by line, but sounds like it is not a great solution as the line breaks and page breaks in PDF are not proper. Please me let me know if there is any other appropriate

PDFBox Avoid Do you want to save changes before closing

阅读更多关于 PDFBox Avoid Do you want to save changes before closing

问题 I am currently trying to add a button to an existing pdf page that upon clicking the button it closes the current tab. I have achieved that with the following code using PDFbox 2.0.15... try { InputStream pdfInput = new FileInputStream(new File("C:\\Users\\justi\\Desktop\\test\\real.pdf")); PDDocument doc = PDDocument.load(pdfInput); PDPage page = doc.getPage(0); // PDDocument doc = new PDDocument(); // PDPage page = new PDPage(); // doc.addPage(page); COSDictionary acroFormDict = new

Using PDFBox, how to set an appearance stream on annotation?

阅读更多关于 Using PDFBox, how to set an appearance stream on annotation?

问题 I am trying to highlight some text and convert it to image. I tried some stuff but the annotation did not came out on the image. Looking for help found this issue http://issues.apache.org/jira/browse/PDFBOX-2162 which said that I must set appearance-stream to the annotation, something that acrobat reader do it automatically, but when converting to image it is needed. I could not figure out how to set the appearance-stream to the annotation. looked for some examples on annotations and

Confusion about current transformation matrix in a PDF

阅读更多关于 Confusion about current transformation matrix in a PDF

问题 I am having some confusions about the current transformation matrix (CTM) in PDFs. For page 5 in this PDF, I have examined the Token Stream (http://pastebin.com/k6g4BGih) and that shows the last cm operation before the curve (c) commands sets the transfomration matrix to COSInt{10},COSInt{0},COSInt{0},COSInt{10},COSInt{0},COSInt{0} . The full output is at http://pastebin.com/9XaPQQm9 . Next I used the following set of codes to extract the line and curve commands from the same page following a

How to extract page number from PDF file [closed]

阅读更多关于 How to extract page number from PDF file [closed]

问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 5 years ago . We explored so many API's like tika,Pdfbox and itextpdf to extract page number from pdf file but we did not able to do this. In itextpdf we got PdfPageLabels.getPageLabels(reader) but the behaviour of this method is not uniform. 回答1: The reason why you don't find any software that

Filling landscape PDF with PDFBox

阅读更多关于 Filling landscape PDF with PDFBox

问题 I try to fill a PDF form with PDFBox and I managed to do it well with a portrait oriented document. But I have a problem when filling a document in landscape mode. The fields are filled up, but the text orientation is not good. It appear vertically like if it was still in portrait but in a rotation of 90 degrees. Here is my simplified code: PDDocument pdfDoc = PDDocument.load(MY_FILE); PDDocumentCatalog docCatalog = pdfDoc.getDocumentCatalog(); PDAcroForm acroForm = docCatalog.getAcroForm();