Java Apache POI read Word (.doc) file and get named styles used

后端 未结 1 463
温柔的废话
温柔的废话 2021-01-24 18:06

I am trying to read a Microsoft Word 2003 Document (.doc) using poi-scratchpad-3.8 (HWPF). I need to either read the file word by word, or character by character. Either way i

1条回答
  •  耶瑟儿~
    2021-01-24 18:42

    I would suggest that you take a look at the sourcecode to WordExtractor from Apache Tika, as it's a great example of getting text and styling from a Word document using Apache POI

    Based on what you did and didn't say in your question, I suspect you're looking for something a little like this:

        Range r = document.getRange();
        for(int i=0; ip.getStyleIndex()) {
              StyleDescription style =
                   document.getStyleSheet().getStyleDescription(p.getStyleIndex());
              String styleName = style.getName();
              System.out.println(styleName + " -> " + text);
           }
           else {
              // Text has an unknown or invalid style
           }
        }
    

    For anything more advanced, take a look at the WordExtractor sourcecode and see what else you can do with this sort of thing!

    0 讨论(0)
提交回复
热议问题