pdfbox

PDFBOX: Convert a pdf to text or html, including images from the pdf

a 夏天 提交于 2019-12-12 02:34:21
问题 I am developing a mobile application that converts pdf to html. I found PDFBox, which works very well. I obtained the PDF text or html on one side and the other images. But I want to go a little further, I need the generated html contains the images in the pdf. Can it be done with PDFBox? How? If you know of another free library function to do this, tell me. Thanks in advance. 回答1: Take a look at ExtractImages.java - this will guide you on how to extract images from PDF file. Next investigate

PDFBox 2 unusual memory consumption

浪子不回头ぞ 提交于 2019-12-12 02:29:05
问题 We are trying to render images from different PDF files, using PDFRenderer's method renderImageWithDPI. On a particular PDF, for some pages, the library renderer has a different behaviour. The rendering itself takes way longer than for other similar pages, and the memory consumption reaches unusually big values: the memory consumed by the process goes up with about 50MB every 1 - 2 seconds, until it reaches values like 5GB of RAM consumed by the application process while in renderImageWithDPI

How to get resource names for optional content group in a pdf?

可紊 提交于 2019-12-12 02:26:12
问题 I am trying to implement functionality to allow user to add markups to existing layers in a pdf. Here is the code that I am using to draw lines on to a layer in a pdf: PDResources resources = page.findResources(); PDPropertyList props = resources.getProperties(); COSName resourceName = getLayerResourceName("Superimposed3", resources, props); PDPageContentStream cs1 = new PDPageContentStream(document, page, true, false); cs1.beginMarkedContentSequence(COSName.OC, resourceName); cs1

PDFBox draw black image from BufferedImage

怎甘沉沦 提交于 2019-12-12 02:06:03
问题 I try to draw an image from a bufferedImage into a PDF using PDFBox but fails, and I get black images and Acrobat Reader warns whith errors like "Out of memory" (but PDF is display). I use a bufferedImage because I need to draw a JavaFX Image object (with came from call to Funciones.crearImagenDesdeTexto(), is a function which converts a text into an Image) into PDF. Rest of images works well without using bufferedimage. PDPixelMap img = null; BufferedImage bi; try { //If item has id, I try

How to get page content height using pdfbox

血红的双手。 提交于 2019-12-12 01:59:31
问题 Is this possible to get the height of the page content using pdfbox? I think I tried everything but each (PDRectangle) returns full height of the page: 842. First I thought that this is because the page number place at the bottom of the page, but when I opened pdf in Illustrator, the whole content is inside compound element, and isn't extended to the whole page height. So if illustrator can see it as separate element and calculate its height, I guess this should also be possible in pdfbox.

PDFBox on Mac critical error when silent printing

你。 提交于 2019-12-12 01:51:59
问题 I have been experimenting with bumping my applications dependency on PDFBox to the 2.0.0 snapshot. I'm having some major issues with it though... So my code recieves a PDF as a BASE64 String, i decode it, and load the resulting bytearray into a PDDocument. Before I bumped the version number, calling .silentPrint(); on the PDDocument worked like a charm. The implementation of silent printing changed in 2.0.0, and I now do it this way: private Status doPdfPrint(Document document, PrintService

Get text layer of a PDF as is and pass it to another PDF

对着背影说爱祢 提交于 2019-12-12 01:36:10
问题 Good afternoon , I have a problem in my project, this is PDF compression , the process is as follows: Extract images from a PDF Hang OCR Compression Stock OCR + Merge image and convert PDF per page Combine all the generated pdf with OCR, OCR PDFcon one out as a final product. The size of my original file is 11 MB and 4.2 MB compressed . The whole process works perfectly , but the problem that I have is the speed in the OCR process . I was checking on the web, and I saw a way to circumvent

Extracting text from an area with PDFbox

て烟熏妆下的殇ゞ 提交于 2019-12-12 00:56:30
问题 is it possible to extract text from an area with PDFbox using just the binaries instead of having to create my own code? 回答1: Compile and pack this simple program into a jar import java.awt.geom.Rectangle2D; import java.io.File; import java.io.IOException; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.text.PDFTextStripperByArea; public class ExtractText { // Usage: xxx.jar filepath page x y width height public static void main

Convert Tiff to Pdf in java using itext

痞子三分冷 提交于 2019-12-12 00:37:45
问题 I am using the below code for converting tiff to pdf It works fine for tiff images of dimensions 850*1100.But when I am trying to give the input tiff image of dimensions(Eg :- 1574*732, 684*353 or other 850*1100), I am getting the below error. Please help me how to convert tiff images of different dimensions to pdf. Error Occured for below code . Compression JPEG is only supported with a single strip. This image has 45 strips. RandomAccessFileOrArray myTifFile = null; com.itextpdf.text

PDF manipulation with placeholders

倖福魔咒の 提交于 2019-12-12 00:27:15
问题 I am looking for a Java tool that can manipulate an existing PDF containing placeholders like ${foo} . I want to generate mail merge documents from that. I found a lot of solutions with forms but this seems not suitable for me. Currently I generate the PDF with iText but this is a really annoying task to convert existing Word files or similar. I didn't find another solution with iText so far. I also used JODReports in conjunction with JODConverter but it is necessary to run OpenOffice as a