Java: Apache POI: Can I get clean text from MS Word (.doc) files?
问题 The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word. When using the following code: File someFile = new File("some\\path\\MSWFile.doc"); InputStream inputStrm = new FileInputStream(someFile); HWPFDocument wordDoc = new HWPFDocument(inputStrm); System.out.println(wordDoc.getText()); the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like