How to use 'catdoc' to display dock file encoded in utf-8

扶醉桌前 提交于 2019-12-08 10:30:31

问题


I have a a lot of docx files and I want to read them on terminal. And I found catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/

When I use it, the output are just unreadable chars. My docx files are encoded in utf-8. I tried "catdoc -u my_file.docx" but does not work.

Please help. Thank you very much.


回答1:


docx are zipped XML files.

To extract and strip the XML try something based on

unzip -p "*.docx" word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

from command line fu




回答2:


It is my naïve understanding that catdoc can generally only be used on DOC files. DOCX files are something like a zipped container with a bunch of information in them; among which you can find the original document in some sort of XML format.

Having said that, I have had pleasant success extracting the contents of DOCX files, or even DOTX files for that matter, using either doc2txt tool or the unoconv tool, the latter of which needs the OpenOffice or LibreOffice suite installed.

Here are some example workflows, which I have used successfully in the past:

# This one, contrary to the unoconv case, does not fire up an instance
# of either LibreOffice or OpenOffice.
docx2txt.pl < ./pesky-word-doc.docx > ./pesky-word-doc.txt

# This one, however, does fire up a rather heavy 'headless' OpenOffice
# or LibreOffice instance process per conversion. You can get around this
# using the next approach below.
unoconv -f txt -o ./pesky-word-doc.txt ./pesky-word-doc.docx

# If you need to convert a couple of dozens such documents, you might want
# to run it via a service port (you get the idea):
unoconv --listener --port=2002 &
unoconv -f txt -o outdir *.docx
unoconv -f pdf -o outdir *.docx && open ./outdir/*.pdf # Convenient, if you run MacOSX
kill -15 %-

# Kind of introducing catdoc: The sed was needed for German documents where
# somehow I couldn't find the proper encoding settings.
unoconv -f doc -o ./pesky-word-doc.doc ./pesky-word-doc.docx && \
          catdoc -u ./pesky-word-doc.doc | sed 's/ь/ü/g;s/д/ä/g;s/ц/ö/g'

There are other options, like using some of the available java parsers to be found here and here. The output quality differs and depending on your intended usage requires you to go for either one of the approaches.



来源:https://stackoverflow.com/questions/15557573/how-to-use-catdoc-to-display-dock-file-encoded-in-utf-8

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!