Extract footer data of PDF in java

问题

I am able to get data from pdf pages in a string. But along with those, footer data is also extracted. I want to remove those from all the pages of pdf. How can I remove that I used Rectangle2D but coordinates are not giving data

回答1:

In a comment the OP indicated that he used this code:

PDDocument doc = PDDocument.load("xyz.pdf");
PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get( 1 );
Rectangle2D region = new Rectangle2D.Double(10, 10, 10, 10);
String regionName = "region";
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.addRegion(regionName, region);
stripper.extractRegions(page);
System.out.println("Region is "+ stripper.getTextForRegion("region"));

For most documents this code will extract no text because it looks at a small (10x10 pt) region in the upper left region of the second document page. Thus, the values in new Rectangle2D.Double(10, 10, 10, 10) have to change.

I tried with various regions , yet I am not getting any text, If you have idea for a normal pdf page , you should share

There is nothing like a normal pdf page. The goal of PDF is to enable users to exchange and view electronic documents easily and reliably, independent of the environment in which they were created or the environment in which they are viewed or printed. There is no serious restriction on page dimensions or location of content on pages.

E.g. for this form

you need values like these

PDPage page = (PDPage)doc.getDocumentCatalog().getAllPages().get(0);
Rectangle2D region = new Rectangle2D.Float(0f, 230f, 612f, 300f);

to extract the body "I authorize any health plan ... I have received a copy of this authorization." without headers, footers, or form lines.

If you have many similar pages (e.g. one large document with many pages with a similarly layout), you have to measure but once for many pages to extract.

来源：https://stackoverflow.com/questions/26143942/extract-footer-data-of-pdf-in-java

标签

java

pdfbox