pdfbox

maven的生命周期,插件介绍(二)

独自空忆成欢 提交于 2019-12-04 09:34:27
Apache POI API及使用教程 Apache POI是一个开源的Java读写Excel、WORD等微软OLE2组件文档的项目。目前POI已经有了Ruby版本。 结构: HSSF - 提供读写Microsoft Excel XLS格式档案的功能。 XSSF - 提供读写Microsoft Excel OOXML XLSX格式档案的功能。 HWPF - 提供读写Microsoft Word DOC97格式档案的功能。 XWPF - 提供读写Microsoft Word DOC2003格式档案的功能。 HSLF - 提供读写Microsoft PowerPoint格式档案的功能。 HDGF - 提供读Microsoft Visio格式档案的功能。 HPBF - 提供读Microsoft Publisher格式档案的功能。 HSMF - 提供读Microsoft Outlook格式档案的功能。 由于涉及内容太多,关于API及使用方法我仅仅列出学习途径;可以作学习教程,也可以开发时作为查询工具: 一、官方文档(英语) Apache poi 官网API文档 二、优秀教程 易百网中文教程(墙裂推荐): Java POI Excel v3.17使用教程 Java POI Word v3.17使用教程 Java POI PPT v3.17使用教程 Java PDF pdfbox 使用教程

Why some of the content is getting cropped off after resizing the page to 7.31 x 11 size?

狂风中的少年 提交于 2019-12-04 06:22:30
问题 When I am trying to resize the page to 7.31 x 11 , some of the content in that page is getting cropped off the window. Below is the link for my output document. http://www.filedropper.com/mynewdocument Below is my source code import java.awt.print.PrinterException; import java.io.File; import java.io.IOException; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.common.PDRectangle; import org.apache.pdfbox.pdmodel.encryption

PDFTextStripper parsing with wrong encoding

混江龙づ霸主 提交于 2019-12-04 06:18:20
问题 PDFTextStripper stripper = new PDFText2HTML(encoding); String result = stripper.getText(document).trim(); result contains something like <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat SeLe EE rev</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body> <div style="page-break-before:always; page-break-after:always"><div><p>&#0;&#1;&#2;&#3;&#4;&#5;&#6;&#7;&#... instead of <

PDFBox Button execute javascript to close document

给你一囗甜甜゛ 提交于 2019-12-04 06:11:32
问题 My use case is to have a button like so on a pdf page (really to add them to existing pages but for now I just want to see it work on anything). ---------- - Back - ---------- And what it does is just closes the current pdf page. The idea is to have multiple tabs opened and each tab is a pdf and then when you hit the "Back" button it closes the current pdf which will then focus back to the previous pdf. This is what I have been trying to use so far. // Create a new empty document try {

open PDF File with parameters

偶尔善良 提交于 2019-12-04 05:58:24
问题 i am working on a java based tool, which should search for PDF files on selected directories and which should search for special words/sentences in this PDF files. After that a JList shows the files which fits and with a double-click on one of these entries the PDF Reader (Adobe Reader) should open this file directly on the page where the word/sentence appeares. I tried two different things. Runtime.exec: try{ Runtime.getRuntime().exec("rundll32" + " " + "url.dll,FileProtocolHandler /A page=4

Java PDFBox - Reading and modifying a pdf with special characters (diacritics)

纵饮孤独 提交于 2019-12-04 05:35:15
i'm trying to modify a pdf using this method (first code block - using PDFStreamParser and iterating through PDFOperator, then updating COSString when needed): http://www.coderanch.com/t/556009/open-source/PdfBox-Replace-String-double-pdf I'm having an issue with some UTF-8 characters (diacritics): when I print the text that i want to update it show like "Societ? ?ii Na?ionale" (where '?' is a code like 0002 or 0004). The funny things are: when I write the updated pdf file, the characters are show correctly (even though i could't detected and replace them) if i try to strip the text using

How to change the coordinates of a text in a pdf page from lower left to upper left

谁都会走 提交于 2019-12-04 05:31:50
问题 I am using PDFBOX and itextsharp dll and processing a pdf. so that I get the text coordinates of the text within a rectangle. the rectangle coordinates are extracted using the itextsharp.dll. Basically I get the rectangle coordinates from itextsharp.dll, where itextsharp uses the coordinates system as lower left. And I get the pdf page text from PDFBOX, where PDFBOX uses the coordinates system as top upper left. I need help in converting the Coordinates from lower left to upper left Updating

How to replace a space with a word while extract the data from PDF using PDFBox

天涯浪子 提交于 2019-12-04 05:14:50
问题 I want to replace any empty column with a word; for example, the word BLK while extract Pdf data. the below tables are the example of the expected table and actual result. Original Table +--------------------------------------+ |# |NAME |TEL |GENDER | |---------------------------|----------| |1 |JOHN |096587498 |M | |2 |VILLA | |F | +--------------------------------------+ Expected Result # NAME TEL GENDER 1 JOHN 096587498 M 2 VILLA BLK F Actual Result # NAME TEL GENDER 1 JOHN 096587498 M 2

Put a Button on PDF with PDFBox 2.x

爷,独闯天下 提交于 2019-12-04 04:17:32
问题 I hope somebody can help me with my Problem with Buttons and Textfields on a PDF created with PdfBox 2.x. I tried to put a Button on my Page, which sets a Date in a Textfield with a Javascript function. That works fine. I then tried to put the Textfield and the Button in a Document with more than one page, so that the Textfield and the Button appears on every Page, but in that way, that the Button on the page writes the Date only to the Textfield on the Page where the Button is, I clicked on.

Copy+pasting text from PDF results in garbage

五迷三道 提交于 2019-12-04 02:34:32
I am writing a Master's thesis - NLP system. I have one component - extractor. It is extracting a plain text from PDF files. There are a few PDF files that can not be extracted correctly. Extractor (PDFBox library) returns a string like this: "┤xDn║if|d├gDF"Ti&cD╬lh d FÁhis~n ╗xd f«"d┤ffih »h" or "10a61a91a22a25a3a27a17a23a20a8a13a14a61a25a17" I was checking each file that makes this extraction's problem and all these files' text also can not be copy-pasted from PDF Reader (Adobe Reader and FoxIt reader). Viewing them in this readers is enabled, but after selecting its content and copying to