问题
I have a a lot of docx files and I want to read them on terminal. And I found catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/
When I use it, the output are just unreadable chars. My docx files are encoded in utf-8. I tried "catdoc -u my_file.docx" but does not work.
Please help. Thank you very much.
回答1:
docx are zipped XML files.
To extract and strip the XML try something based on
unzip -p "*.docx" word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
from command line fu
回答2:
It is my naïve understanding that catdoc can generally only be used on DOC files. DOCX files are something like a zipped container with a bunch of information in them; among which you can find the original document in some sort of XML format.
Having said that, I have had pleasant success extracting the contents of DOCX files, or even DOTX files for that matter, using either doc2txt tool or the unoconv tool, the latter of which needs the OpenOffice or LibreOffice suite installed.
Here are some example workflows, which I have used successfully in the past:
# This one, contrary to the unoconv case, does not fire up an instance
# of either LibreOffice or OpenOffice.
docx2txt.pl < ./pesky-word-doc.docx > ./pesky-word-doc.txt
# This one, however, does fire up a rather heavy 'headless' OpenOffice
# or LibreOffice instance process per conversion. You can get around this
# using the next approach below.
unoconv -f txt -o ./pesky-word-doc.txt ./pesky-word-doc.docx
# If you need to convert a couple of dozens such documents, you might want
# to run it via a service port (you get the idea):
unoconv --listener --port=2002 &
unoconv -f txt -o outdir *.docx
unoconv -f pdf -o outdir *.docx && open ./outdir/*.pdf # Convenient, if you run MacOSX
kill -15 %-
# Kind of introducing catdoc: The sed was needed for German documents where
# somehow I couldn't find the proper encoding settings.
unoconv -f doc -o ./pesky-word-doc.doc ./pesky-word-doc.docx && \
catdoc -u ./pesky-word-doc.doc | sed 's/ь/ü/g;s/д/ä/g;s/ц/ö/g'
There are other options, like using some of the available java parsers to be found here and here. The output quality differs and depending on your intended usage requires you to go for either one of the approaches.
来源:https://stackoverflow.com/questions/15557573/how-to-use-catdoc-to-display-dock-file-encoded-in-utf-8