How to determine artificial bold style ,artificial italic style and artificial outline style of a text using PDFBOX

后端 未结 2 826
花落未央
花落未央 2020-11-28 16:41

I am using PDFBox for validating a pdf document . There are certain requirement to check following types of text present in a PDF

  • Artificial Bold style text
2条回答
  •  余生分开走
    2020-11-28 16:53

    My solution for this problem was to create a new class that extends the PDFTextStripper class and overrides the function:

    getCharactersByArticle()

    note: PDFBox version 1.8.5

    CustomPDFTextStripper class

    public class CustomPDFTextStripper extends PDFTextStripper
    {
        public CustomPDFTextStripper() throws IOException {
        super();
        }
    
        public Vector> getCharactersByArticle(){
        return charactersByArticle;
        }
    }
    

    This way i can parse the pdf document and then get the TextPosition from a custom extraction function:

     private void extractTextPosition() throws FileNotFoundException, IOException {
    
        PDFParser parser = new PDFParser(new FileInputStream(pdf));
        parser.parse();
        StringWriter outString = new StringWriter();
        CustomPDFTextStripper stripper = new CustomPDFTextStripper();
        stripper.writeText(parser.getPDDocument(), outString);
        Vector> vectorlistoftps = stripper.getCharactersByArticle();
        for (int i = 0; i < vectorlistoftps.size(); i++) {
            List tplist = vectorlistoftps.get(i);
            for (int j = 0; j < tplist.size(); j++) {
                TextPosition text = tplist.get(j);
                System.out.println(" String "
              + "[x: " + text.getXDirAdj() + ", y: "
              + text.getY() + ", height:" + text.getHeightDir()
              + ", space: " + text.getWidthOfSpace() + ", width: "
              + text.getWidthDirAdj() + ", yScale: " + text.getYScale() + "]"
              + text.getCharacter());
            }       
        }
    }
    

    TextPositions contain numerous information about the characters of the pdf document.

    OUTPUT:

    String [x: 168.24, y: 64.15997, height:6.061287, space: 8.9664, width:3.4879303, yScale: 8.9664]J

    String [x: 171.69745, y: 64.15997, height:6.061287, space: 8.9664, width: 2.2416077, yScale:8.9664]N

    String [x: 176.25777, y: 64.15997, height:6.0343876, space: 8.9664,width: 6.4737396, yScale:8.9664]N

    String [x: 182.73778, y:64.15997, height:4.214208, space: 8.9664, width: 3.981079, yScale: 8.9664]e .....

提交回复
热议问题