pdfbox

Printing Chinese characters in pdfbox

只谈情不闲聊 提交于 2021-01-27 07:30:58
问题 I'm using the following set-up: Java 11.0.1 pdfbox 2.0.15 Objective: Rendering a pdf that contains Chinese characters Problem: java.lang.IllegalArgumentException: U+674E is not available in this font's encoding: WinAnsiEncoding I already tried: Using different fonts for Chinese character support. The latest one is NotoSansCJKtc-Regular.ttf Set font to unicode as described here: Java: Write national characters to PDF using PDFBox, however the used loadTTF method is deprecated. Using Arial

PDFBox: extract image location (wrong x and y)

旧城冷巷雨未停 提交于 2021-01-27 05:27:34
问题 Hello again fellow programmers. I can extract PDF text coordinates and its format properly. But I can't do it with image. I can get the proper width and height but it gives me wrong x and y . I'm using Photoshop to check if I'm getting the proper x , y , width , height coordinates, but only the width and height are correct Here is my code: @Override public void processOperator(Operator operator, List<COSBase> arguments) throws IOException { if ("cm".equals(operator.getName())) { float width =

PDFBox: extract image location (wrong x and y)

前提是你 提交于 2021-01-27 05:27:07
问题 Hello again fellow programmers. I can extract PDF text coordinates and its format properly. But I can't do it with image. I can get the proper width and height but it gives me wrong x and y . I'm using Photoshop to check if I'm getting the proper x , y , width , height coordinates, but only the width and height are correct Here is my code: @Override public void processOperator(Operator operator, List<COSBase> arguments) throws IOException { if ("cm".equals(operator.getName())) { float width =

How to read PDF departments(header,abstract,refrences) With PDFBox?

一笑奈何 提交于 2021-01-24 13:51:59
问题 I am trying to read a PDF file and its departments, but I can't find an algorithm or library to do it correctly. I want to separate the parts of a file(Header,abstract,refrences) and return their contents. Does a PDFBox reference exist to solve to this problem? 回答1: The file provided by the OP as representative example unfortunately is not tagged. Thus, there are no direct information indicating whether a given piece of text belongs to the title, the abstract, the references, or which part

How to read PDF departments(header,abstract,refrences) With PDFBox?

江枫思渺然 提交于 2021-01-24 13:51:27
问题 I am trying to read a PDF file and its departments, but I can't find an algorithm or library to do it correctly. I want to separate the parts of a file(Header,abstract,refrences) and return their contents. Does a PDFBox reference exist to solve to this problem? 回答1: The file provided by the OP as representative example unfortunately is not tagged. Thus, there are no direct information indicating whether a given piece of text belongs to the title, the abstract, the references, or which part

使用pdfBox实现pdf转图片,解决中文方块乱码等问题

孤者浪人 提交于 2021-01-24 11:35:26
使用pdfBox实现pdf转图片,解决中文方块乱码等问题 参考文章: (1)使用pdfBox实现pdf转图片,解决中文方块乱码等问题 (2)https://www.cnblogs.com/hujunzheng/p/10508044.html 备忘一下。 来源: oschina 链接: https://my.oschina.net/u/4384923/blog/4922110

How to remove a specific image from a PDF with PDFBox

我是研究僧i 提交于 2021-01-07 02:52:55
问题 I need to remove a specific image from PDF file according its metadata. Sadly. all examples I can find in Internet are using discarded methods. I write it something like this: try (PDDocument doc = PDDocument.load(new ByteArrayInputStream(pdf))) { doc.getPages().forEach(page -> { PDResources resources = page.getResources(); List<COSName> itemsToRemove = new ArrayList<>(); resources.getXObjectNames().forEach(propertyName -> { if(!resources.isImageXObject(propertyName)) { return; } PDXObject

Java pdf 转 图片

隐身守侯 提交于 2021-01-04 16:42:59
maven 依赖: <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>2.0.8</version> </dependency> <dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox-tools</artifactId> <version>2.0.8</version> </dependency> 代码示例: private static final int HOME_PAGE_INDEX = 0; /** * Pdf -> Image (首页) * * [@param](https://my.oschina.net/u/2303379) pdf pdf流 * [@param](https://my.oschina.net/u/2303379) format 图片格式 * [@return](https://my.oschina.net/u/556800) pdf 图片流 */ public static byte[] getImageFromPdf(byte[] pdf, String format) { return pdfHomePageToImage

java-pdf转word

自古美人都是妖i 提交于 2020-12-12 11:21:21
注:原文来至 《 java-pdf转word 》 一: java Pdf 文字 转 Word 废话不说,直接上图 很简单的用法: 1、new个PDFBox对象 2、调用pdfToDoc()方法,再传一个参数(文件路径) 最新jar下载地址:链接:https://pan.baidu.com/s/1snqjpSx 密码:jujg 或者加QQ群: 464429490(在群文件中) 二:Java Pdf 图片表格 转 word 文章来源: 《 java-pdf转图片 》 很多人反应pdf转doc 图片丢失,表格丢失,样式丢失,编码问题等等。 没错这段代码就是只能把文字转为doc文件的 因为:stripper.writeText(doc,writer); doc指doc文件 writer指 FileOutputStream fos=new FileOutputStream(“pdf文件地址”); Writer writer=new OutputStreamWriter(fos,”UTF-8”); 所以我们想出了用js生成图片,或者pdf先转成图片 js全屏截图: 1 function takeScreenshot() { 2 html2canvas( document .body, { 3 onrendered: function (canvas) { 4 document .body

【PdfBox】pdfbox解析PDF

巧了我就是萌 提交于 2020-12-08 00:57:32
前言 有时候会有这样的需求,需要将pdf中的字解析出来,存入库中,查看了一下pdfbox的文档,大概有两种方案。 一、全文解析 当一个pdf中全是文字并且排列规整的时候,直接全文解析出来就好,以下是全文解析代码: public String getTextFromPdf() throws Exception { String pdfPath = “pdf文件路径”; // 开始提取页数 int startPage = 1; // 结束提取页数 int endPage = Integer.MAX_VALUE; String content = null; File pdfFile = new File(pdfPath); PDDocument document = null; try { // 加载 pdf文档 document = PDDocument.load(pdfFile); // 获取内容信息 PDFTextStripper pts = new PDFTextStripper(); pts.setSortByPosition(true); endPage = document.getNumberOfPages(); System.out.println("Total Page: " + endPage); pts.setStartPage(startPage); pts