问题
I extract text from pdf file using pdfbox,when I get font for some text in pdf it get null i don't why! although some another text in the same file i get its font.
using this code:
protected void processTextPosition(TextPosition text) {
String font=text.getFont().getBaseFont(); // equal null
}
回答1:
String font=text.getFont().getBaseFont(); // equal null
PDFont.getBaseFont
is implemented to simply return the value of the BaseFont entry of the respective font dictionary.
Not all fonts provide a BaseFont entry in their font dictionary, though. In such a case that message will return null
.
According to the PDF specification you can only expect fonts to have that entry if they are Type0 (composite), Type1, or TrueType fonts. If they are Type3, they don't have that entry.
This actually makes sense: Type3 fonts are pure PDF stuff down to their glyph definitions; thus, there is no base font to consider.
In case of Type0 (composite) fonts you might actually consider looking at the descendant font (using PDType0Font.getDescendantFont()
) and inspecting its BaseFont entry because the entry of the composite font is specified as a composition of the descendant's base font name and a CMap name.
And while all of the above is true for PDF following the specification, you have to get used to seeing PDFs in the wild which do not follow the spec 100%. As the base font entry is not always strictly necessary for PDF handling in general, there surely are PDFs in the wild which don't provide the base font entry in such cases.
Thus, always reckon with null
values (or values not following the spec) here.
来源:https://stackoverflow.com/questions/21577850/getbasefont-equal-null-in-pdfbox