How to extract bold text from pdf using pdfbox?

前端 未结 1 1824
迷失自我
迷失自我 2020-12-09 06:33

I am using a Apache pdfbox for extracting text. I can extract the text from pdf but I dont know how to know that whether the word is bold or not??? (code suggestion would be

相关标签:
1条回答
  • 2020-12-09 06:53

    The result of PDFTextStripper is plain text. After extracting it, therefore, it is too late. But you can override certain methods of it and only let through text which is formatted according to your wishes.

    In case of the PDFTextStripper you have to override

    protected void processTextPosition( TextPosition text )
    

    In your override you check whether the text in question fulfills your requirements (TextPosition contains much information on the text in question, not only the text itself), and if it does, forward the TextPosition text to the super implementation.

    The main problem is, though, to recognize which text is bold.

    Criteria for boldness may be the word bold in the font name, e.g. Courier-BoldOblique - you access the font of the text using text.getFont() and the postscript name of the font using the font's getBaseFont() method

    String postscriptName = text.getFont().getBaseFont();
    

    Criteria may also be from the font descriptor - you get the font descriptor of a font using the getFontDescriptor method, and a font descriptor has an optional font weight value

    float fontWeight = text.getFont().getFontDescriptor().getFontWeight();
    

    The value is defined as

    (Optional; PDF 1.5; should be used for Type 3 fonts in Tagged PDF documents) The weight (thickness) component of the fully-qualified font name or font specifier. The possible values shall be 100, 200, 300, 400, 500, 600, 700, 800, or 900, where each number indicates a weight that is at least as dark as its predecessor. A value of 400 shall indicate a normal weight; 700 shall indicate bold.

    The specific interpretation of these values varies from font to font.

    EXAMPLE 300 in one font may appear most similar to 500 in another.

    (Table 122, Section 9.8.1, ISO 32000-1)

    There may be additional hints towards bold-ism to check, e.g. a big line width

    double lineWidth = getGraphicsState().getLineWidth();
    

    when the rendering mode draws an outline, too:

    int renderingMode = getGraphicsState().getTextState().getRenderingMode();
    

    You may have to try with your the documents you have at hand which criteria suffice.

    0 讨论(0)
提交回复
热议问题