Get font of each line using PDFBox

匿名 (未验证) 提交于 2019-12-03 02:45:02

问题:

Is there a way to get the font of each line of a PDF file using PDFBox? I have tried this but it just lists all the fonts used in that page. It does not show what line or text is showed in that font.

List pages = doc.getDocumentCatalog().getAllPages(); for(PDPage page:pages) { Map pageFonts=page.getResources().getFonts(); for(String key : pageFonts.keySet())    {     System.out.println(key+" - "+pageFonts.get(key));     System.out.println(pageFonts.get(key).getBaseFont());     } } 

Any input is appreciated. Thanks!

回答1:

Whenever you try to extract text (plain or with styling information) from a PDF using PDFBox, you generally should start trying using the PDFTextStripper class or one of its relatives. This class already does all the heavy lifting involved in PDF content parsing for you.

You use the plain PDFTextStripper class like this:

PDDocument document = ...; PDFTextStripper stripper = new PDFTextStripper(); // set stripper start and end page or bookmark attributes unless you want all the text String text = stripper.getText(document); 

This returns merely the plain text, e.g. from some R40 form:

You can, on the other hand, overwrite its method writeString(String, List) and process more information than the mere text. To add information on the name of the used font wherever the font changes, you can use this:

PDFTextStripper stripper = new PDFTextStripper() {     String prevBaseFont = "";      protected void writeString(String text, List textPositions) throws IOException     {         StringBuilder builder = new StringBuilder();          for (TextPosition position : textPositions)         {             String baseFont = position.getFont().getBaseFont();             if (baseFont != null && !baseFont.equals(prevBaseFont))             {                 builder.append('[').append(baseFont).append(']');                 prevBaseFont = baseFont;             }             builder.append(position.getCharacter());         }          writeString(builder.toString());     } }; 

For the same form you get

If you don't want the font information to be merged with the text, simply create separate structures in your method overwrite.

TextPosition offers a lot more information on the piece of text it represents. Inspect it!



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!