Convert Word doc or docx files into text files?

前端 未结 11 524
难免孤独
难免孤独 2020-12-05 01:28

I need a way to convert .doc or .docx extensions to .txt without installing anything. I also don\'t want to have to manually open Wor

11条回答
  •  长情又很酷
    2020-12-05 01:48

    For .doc, I've had some success with the linux command line tool antiword. It extracts the text from .doc very quickly, giving a good rendering of indentation. Then you can pipe that to a text file in bash.

    For .docx, I've used the OOXML SDK as some other users mentioned. It is just a .NET library to make it easier to work with the OOXML that is zipped up in an OOXML file. There is a lot of metadata that you will want to discard if you are only interested in the text. Some other people have already written the code I see: DocXToText.

    Aspose.Words has a very simple API with great support too I have found.

    There is also this bash command from commandlinefu.com which works by unzipping the .docx:

    unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
    

提交回复
热议问题