Read Microsoft Word Documents into Plain Text (DOC, DOCX) in Java

故事扮演 提交于 2019-12-21 16:46:52

问题


I'm looking for something in Java to read in Word documents to process their text.. all I need is there text, nothing fancy. I know about Apache POI, however it doesn't include support for DOCX right now, anything out there?


回答1:


If you don't require formatting information, images and all other fancy stuff, then the job is lot easier. Just some 5 to 10 lines of code will do.

  1. Treat DOCX as a zip file. It consists a bunch of files which includes 'document.xml'. Use ZipInputStream and extract that file alone. (you may use your favorite zip utility and open docx and see for yourself!)
  2. Use a SAX parser and read contents between node body/p/r/t - voila you got the text!

This is applicable only if you need the text only.




回答2:


With some googling I found OpenXML4J. This might solve your issue. I have not used this before I am sure someone in the community will have better insight.

Note: This is a duplicate question. This has the solution plus a bit of discussion. Link to the question.




回答3:


Try apache poi - it can handle doc, docx, xls, xlsx, ppt, pptx.

Another production-level solution is OpenOffice in headless mode which can even be used in a server-side scenario.




回答4:


You could try docx4j; see http://dev.plutext.org/svn/docx4j/trunk/docx4j/src/main/java/org/docx4j/TextUtils.java



来源:https://stackoverflow.com/questions/2263951/read-microsoft-word-documents-into-plain-text-doc-docx-in-java

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!