问题
On September 28, 2009 the Apache POI project released version 3.5 which officially supports the OOXML formats introduced in Office 2007, like DOCX and XLSX.
Please provide a code sample for extracting a DOCX file's content in plain text, ignoring any styles or formatting.
I am asking this because I have been unable to find any Apache POI examples covering the new OOXML support.
回答1:
This worked for me. Make sure you add the required jars (upgrade xmlbeans, etc.)
public String extractText(InputStream in) throws Exception {
XWPFDocument doc = new XWPFDocument(in);
XWPFWordExtractor ex = new XWPFWordExtractor(doc);
String text = ex.getText();
return text;
}
回答2:
This is more generic
POITextExtractor poitex = ExtractorFactory.createExtractor(in);
return poitex.getText();
来源:https://stackoverflow.com/questions/1492738/how-to-extract-plain-text-from-a-docx-file-using-the-new-ooxml-support-in-apache