Apache POI HWPF - problem in convert doc file to pdf

妖精的绣舞 提交于 2019-12-03 03:21:41

If you look at Apache Tika, there's a good example of reading some style information from a HWPF document. The code in Tika generates HTML based on the HWPF contents, but you should find that something very similar works for your case.

The Tika class is https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java

One thing to note about word documents is that everything in any one Character Run has the same formatting applied to it. A Paragraph is therefore made up of one or more Character Runs. Some styling is applied to a Paragraph, and other parts are done on the runs. Depending on what formatting interests you, it may therefore be on the paragraph or the run.

Amar Gajbhiye

If you use WordExtractor, you will get text only. Try using CharacterRun class. You will get style along with text. Please refer following Sample code.

Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++) {
    org.apache.poi.hwpf.usermodel.Paragraph poiPara = range.getParagraph(i);
    int j = 0;
    while (true) {
        CharacterRun run = poiPara.getCharacterRun(j++);
        System.out.println("Color "+run.getColor());
        System.out.println("Font size "+run.getFontSize());
        System.out.println("Font Name "+run.getFontName());
        System.out.println(run.isBold()+" "+run.isItalic()+" "+run.getUnderlineCode());
        System.out.println("Text is "+run.text());
        if (run.getEndOffset() == poiPara.getEndOffset()) {
            break;
        }
    }
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!