I am using PDFBox for validating a pdf document . There are certain requirement to check following types of text present in a PDF
My solution for this problem was to create a new class that extends the PDFTextStripper class and overrides the function:
getCharactersByArticle()
note: PDFBox version 1.8.5
CustomPDFTextStripper class
public class CustomPDFTextStripper extends PDFTextStripper
{
public CustomPDFTextStripper() throws IOException {
super();
}
public Vector> getCharactersByArticle(){
return charactersByArticle;
}
}
This way i can parse the pdf document and then get the TextPosition from a custom extraction function:
private void extractTextPosition() throws FileNotFoundException, IOException {
PDFParser parser = new PDFParser(new FileInputStream(pdf));
parser.parse();
StringWriter outString = new StringWriter();
CustomPDFTextStripper stripper = new CustomPDFTextStripper();
stripper.writeText(parser.getPDDocument(), outString);
Vector> vectorlistoftps = stripper.getCharactersByArticle();
for (int i = 0; i < vectorlistoftps.size(); i++) {
List tplist = vectorlistoftps.get(i);
for (int j = 0; j < tplist.size(); j++) {
TextPosition text = tplist.get(j);
System.out.println(" String "
+ "[x: " + text.getXDirAdj() + ", y: "
+ text.getY() + ", height:" + text.getHeightDir()
+ ", space: " + text.getWidthOfSpace() + ", width: "
+ text.getWidthDirAdj() + ", yScale: " + text.getYScale() + "]"
+ text.getCharacter());
}
}
}
TextPositions contain numerous information about the characters of the pdf document.
OUTPUT:
String [x: 168.24, y: 64.15997, height:6.061287, space: 8.9664, width:3.4879303, yScale: 8.9664]J
String [x: 171.69745, y: 64.15997, height:6.061287, space: 8.9664, width: 2.2416077, yScale:8.9664]N
String [x: 176.25777, y: 64.15997, height:6.0343876, space: 8.9664,width: 6.4737396, yScale:8.9664]N
String [x: 182.73778, y:64.15997, height:4.214208, space: 8.9664, width: 3.981079, yScale: 8.9664]e .....