Read pdf uploadstream one page at a time with java

问题

I am trying to read a pdf document in a j2ee application.

For a webapplication I have to store pdf documents on disk. To make searching easy I want to make a reverse index of the text inside the document; if it is OCR.

With the PDFbox library its possible to create a pdfDocument object wich contains an entire pdf file. However to preserve memory and improve overall performance I'd rather handle the document as a stream and read one page at a time into a buffer.

I wonder if it is possible to read a filestream containing pdf page by page or even one line at a time.

回答1:

For a given generic pdf document you have no way of knowing where one page end and another one starts, using PDFBox at least.

If your concern is the use of resources, I suggest you parse the pdf document into a COSDocument, extract the parsed objects from the COSDocument using the .getObjects(), which will give you a java.util.List. This should be easy to fit into whatever scarce resources you have.

Note that you can easily convert your parsed pdf documents into Lucene indexes through the PDFBox API.

Also, before venturing into the land of optimisations, be sure that you really need them. PDFBox is able to make an in-memory representation of quite large PDF documents without much effort.

For parsing the PDF document from an InputStream, look at the COSDocument class

For writing lucene indexes, look at LucenePDFDocument class

For in-memory representations of COSDocuments, look at FDFDocument

回答2:

In the 2.0.* versions, open the PDF like this:

PDDocument doc = PDDocument.load(file, MemoryUsageSetting.setupTempFileOnly());

This will setup buffering memory usage to only use temporary file(s) (no main-memory) with no restricted size.

This was answered here.

回答3:

Take a look at the PDF Renderer Java library. I have tried it myself and it seems much faster than PDFBox. I haven't tried getting the OCR text, however.

Here is an example copied from the link above which shows how to draw a PDF page into an image:

    File file = new File("test.pdf");
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    FileChannel channel = raf.getChannel();
    ByteBuffer buf = channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
    PDFFile pdffile = new PDFFile(buf);

    // draw the first page to an image
    PDFPage page = pdffile.getPage(0);

    //get the width and height for the doc at the default zoom 
    Rectangle rect = new Rectangle(0,0,
            (int)page.getBBox().getWidth(),
            (int)page.getBBox().getHeight());

    //generate the image
    Image img = page.getImage(
            rect.width, rect.height, //width & height
            rect, // clip rect
            null, // null for the ImageObserver
            true, // fill background with white
            true  // block until drawing is done
            );

回答4:

I'd imagine you can read through the file byte by byte looking for page breaks. Line by line is more difficult because of possible PDF formatting issues.

来源：https://stackoverflow.com/questions/586257/read-pdf-uploadstream-one-page-at-a-time-with-java

标签

java

pdf

inputstream

pdfbox