Extracting text from an area with PDFbox

て烟熏妆下的殇ゞ 提交于 2019-12-12 00:56:30

问题


is it possible to extract text from an area with PDFbox using just the binaries instead of having to create my own code?


回答1:


Compile and pack this simple program into a jar

import java.awt.geom.Rectangle2D;
import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.text.PDFTextStripperByArea;

public class ExtractText {

    // Usage: xxx.jar filepath page x y width height
    public static void main(String[] args) throws IOException {

        if (args.length != 6) {
            System.out.println("Help info");
            return;
        }



        // Parameters
        String filepath = args[0];

        int page = Integer.parseInt(args[1]);
        int x = Integer.parseInt(args[2]);
        int y = Integer.parseInt(args[3]);
        int width = Integer.parseInt(args[4]);
        int height = Integer.parseInt(args[5]);

        PDDocument document = PDDocument.load(new File(filepath));

        PDFTextStripperByArea textStripper = new PDFTextStripperByArea();
        Rectangle2D rect = new java.awt.geom.Rectangle2D.Float(x, y, width, height);
        textStripper.addRegion("region", rect);


        PDPage docPage = document.getPage(page);

        textStripper.extractRegions(docPage);

        String textForRegion = textStripper.getTextForRegion("region");

        System.out.println(textForRegion);
    }
}

Run it from command line, ex:

xxx.jar  filepathToPdf pageToExtract  x   y   width height

Add validation code for parameters and some usage info.

Edit

Also add the PDFbox libraries

java -cp "..." -jar xxx.jar  filepathToPdf  pageToExtract  x  y  width   height 


来源:https://stackoverflow.com/questions/40101748/extracting-text-from-an-area-with-pdfbox

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!