pdfbox

Create a New custom COSBase objects with pdfbox?

落爺英雄遲暮 提交于 2019-12-24 18:50:16
问题 Can we Create a new custom PDFOperator (like PDFOperator{BDC}) and COSBase objects(like COSName{P} COSName{Prop1} (again Prop1 will reference one more obj)) ? And add these to the root structure of a pdf? I have read some list of parser tokens from an existing pdf document. I wanted to tag the pdf. In that process I will first manipulate the list of tokens with newly created COSBase objects. At last I will add them to root tree structure. So here how can I create a COSBase objects. I am using

Placing an image over text, by using the text postiton in a PDF using PDFBox.

风格不统一 提交于 2019-12-24 18:14:05
问题 Result is that image is not placed correctly over text. Am i getting the text positions wrong? This is an example on how to get the x/y coordinates and size of each character in PDF public class MyClass extends PDFTextStripper { pdocument = PDDocument.load(new File(fileName)); stripper = new GetCharLocationAndSize(); stripper.setSortByPosition(true); stripper.setStartPage(0); stripper.setEndPage(pdocument.getNumberOfPages()); Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());

Print byte[] to pdf using pdfbox

风格不统一 提交于 2019-12-24 15:27:40
问题 I have a question about writting image to pdf using pdfbox. My requirement is very simple, i get an image from a web service using spring restTemplate:i store it in a byte[] variable, but i need to draw the image into a pdf document. I know that the following is provided: final byte[] image = this.restTemplate.getForObject(this.imagesUrl + cableReference + this.format, byte[].class); JPEGFactory.createFromStream for JPEG format, CCITTFactory.createFromFil for TIFF images, LosslessFactory

Excluding super script when extracting text from pdf

杀马特。学长 韩版系。学妹 提交于 2019-12-24 14:37:43
问题 I have extracted text from pdf line by line using pdfbox, to process it with my algorithm by sentences. I am recognizing the sentences by using period(.) followed by a word whose first letter is capital. Here the issue is, when a sentence ends with a word which has superscript, extractor treats it as a normal character and places it next to period(.) For example: expression "2 power 22" when appeared as a last word in a sentence i.e. with a period, it has been extracted as 2.22 which makes it

How to add multiple pages in PDFBox

回眸只為那壹抹淺笑 提交于 2019-12-24 13:19:31
问题 I want to write some content in my PDF using PDFBox. Once the page height is less than the margin I need to create another page. I want to retain the cursor information. I s there a way through which i can get the cursor information like where the cursor is present so i can subtract the margin from cursor position and add another page to it. Right now I have done something like this PDRectangle rect = page.getMediaBox(); float positionY = rect.getWidth(); positionY = positionY - pdfWriter

Pdfbox - adding pdf embedded File and save the PDDocument to OutputStream does not keep the embedded Files

牧云@^-^@ 提交于 2019-12-24 10:55:18
问题 I'm using Pdfbox (1.8.8) to adding attachments to a pdf. My problem is when one of the attachments is of type .pdf and i'm saving the PDDocument to OutputStream the final pdf document does not include the attachments. If a save the PDDocument to a file instead an OutputStream all works just fine, and if the attachments does not include any pdf, both save to file or OutputStream works fine. I would like to know if there is any way to add pdf embedded Files and save the PDDocument to

PDFBox TextPosition width and height in mm

走远了吗. 提交于 2019-12-24 10:37:42
问题 I am using PDFTextStripper to extract text from a PDF. I want to get the width and height, in millimeters, for each TextPosition . This can be found from a given TextPostion tp using tp.getWidth() and tp.getHeight(). My problem is that the value returned is in display unit . I tried to look around to find the right conversion factor but I got confused. I know that PDFs uses different coordinate systems as explained in the PDF documentation (picture below). I also found this post but It may be

java- rotated file extraction?

匆匆过客 提交于 2019-12-24 10:17:43
问题 I am using PDFBox to do a simple extraction of words from a PDF file. Then it inserts those words to a table in database. From what I have tested, a 90 degrees clockwise rotated text in PDF will gives gibberish result when I tried to extract the words. For example, database in the file will yield atabase and also database itself as two different words. Obviously, atabase does not exist in the PDF file. I tried converting the original file to be rotated upright and do the extraction and it

Identify and extract table from pdf using java

依然范特西╮ 提交于 2019-12-24 07:49:37
问题 I have different types of pdf which contain multiple things like text, table etc. The table may exist any place of pdf(top, middle, bottom). I want to extract only table data(No. of the column, no. of rows & data in a table) from that pdf using java without passing location. What I have done till yet:- 1. I have used iText java API to read and extract. Following code used:- PdfTextExtractor.getTextFromPage but It is only returning data in form of text. Didn't get any clue to identify where

No glyph for U+000D in font Helvetica

*爱你&永不变心* 提交于 2019-12-24 07:47:39
问题 How to solve this for pdfbox with boxable. I am getting in table.draw as No glyph for U+000D in font Helvetica What to do.I am building table with boxable 回答1: That error tells you that your strings you use to fill the tables contain CR (carriage return) characters. Do not use control characters (like CR, LF, TAB, ...) in those string as your software stack does not interpret them to mean something like a line break; instead it tries to interpret it as a glyph in the font which it fails doing