Read pdf uploadstream one page at a time with java

◇◆丶佛笑我妖孽 提交于 2019-12-04 22:41:16
Steen

For a given generic pdf document you have no way of knowing where one page end and another one starts, using PDFBox at least.

If your concern is the use of resources, I suggest you parse the pdf document into a COSDocument, extract the parsed objects from the COSDocument using the .getObjects(), which will give you a java.util.List. This should be easy to fit into whatever scarce resources you have.

Note that you can easily convert your parsed pdf documents into Lucene indexes through the PDFBox API.

Also, before venturing into the land of optimisations, be sure that you really need them. PDFBox is able to make an in-memory representation of quite large PDF documents without much effort.

For parsing the PDF document from an InputStream, look at the COSDocument class

For writing lucene indexes, look at LucenePDFDocument class

For in-memory representations of COSDocuments, look at FDFDocument

In the 2.0.* versions, open the PDF like this:

PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());

This will setup buffering memory usage to only use temporary file(s) (no main-memory) with no restricted size.

This was answered here.

Take a look at the PDF Renderer Java library. I have tried it myself and it seems much faster than PDFBox. I haven't tried getting the OCR text, however.

Here is an example copied from the link above which shows how to draw a PDF page into an image:

    File file = new File("test.pdf");
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    FileChannel channel = raf.getChannel();
    ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
    PDFFile pdffile = new PDFFile(buf);

    // draw the first page to an image
    PDFPage page = pdffile.getPage(0);

    //get the width and height for the doc at the default zoom 
    Rectangle rect = new Rectangle(0,0,
            (int)page.getBBox().getWidth(),
            (int)page.getBBox().getHeight());

    //generate the image
    Image img = page.getImage(
            rect.width, rect.height, //width & height
            rect, // clip rect
            null, // null for the ImageObserver
            true, // fill background with white
            true  // block until drawing is done
            );

I'd imagine you can read through the file byte by byte looking for page breaks. Line by line is more difficult because of possible PDF formatting issues.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!