PDFBox: working with very large PDFs.

陌路散爱 提交于 2020-01-02 01:48:06

问题


I am working with some very large PDFs, some over 7GB in size. The PDFs have up to 20,000 pages and many full page color images. I'd like to use PDFBox to work with the PDFs, but due to the size I get OutOfMemoryError's when I attempt to open the PDFs.

I'm working with version pdfbox-app-1.6.0, on Windows 7 using Intellij, java 6.

First I tried writing a simple program that just opened the PDF in a PDDocument and coping each page over to another PDDocument: http://ideone.com/arKhB

Next I tried using the PDFBox CopyDoc example.

Both example run out of memory.

I'm assuming this is because PDFBox is trying to read the whole document into memory. Is there a way to have it only open 1 page at a time? I know it would be slower processing, but at the moment I can't process anything.


回答1:


In the 2.0.* versions, open the PDF like this:

PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());

This will setup buffering memory usage to only use temporary file(s) (no main-memory) with not restricted size.

Update 17.4.2018: More tricks to save memory are described in the FAQ. Not yet described but active since 2.0.9 is subsampling (skip pixel lines/rows) with PDFRenderer.setSubsamplingAllowed(true) when doing rendering. This saves space for PDF files with huge image files.



来源:https://stackoverflow.com/questions/11301818/pdfbox-working-with-very-large-pdfs

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!