How to extract plain text from a DOCX file using the new OOXML support in Apache POI 3.5?

*爱你&永不变心* 提交于 2019-11-29 17:24:33

问题


On September 28, 2009 the Apache POI project released version 3.5 which officially supports the OOXML formats introduced in Office 2007, like DOCX and XLSX.

Please provide a code sample for extracting a DOCX file's content in plain text, ignoring any styles or formatting.

I am asking this because I have been unable to find any Apache POI examples covering the new OOXML support.


回答1:


This worked for me. Make sure you add the required jars (upgrade xmlbeans, etc.)

public String extractText(InputStream in) throws Exception {
    XWPFDocument doc = new XWPFDocument(in);
    XWPFWordExtractor ex = new XWPFWordExtractor(doc);
    String text = ex.getText();
    return text;
}



回答2:


This is more generic

POITextExtractor poitex = ExtractorFactory.createExtractor(in);

return poitex.getText();



来源:https://stackoverflow.com/questions/1492738/how-to-extract-plain-text-from-a-docx-file-using-the-new-ooxml-support-in-apache

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!