Best way to extract text from a Word doc without using COM/automation?

后端 未结 10 1826
遇见更好的自我
遇见更好的自我 2020-12-07 21:29

Is there a reasonable way to extract plain text from a Word file that doesn\'t depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platfo

10条回答
  •  天命终不由人
    2020-12-07 21:47

    I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).

    import os
    
    def doc_to_text_catdoc(filename):
        (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
        fi.close()
        retval = fo.read()
        erroroutput = fe.read()
        fo.close()
        fe.close()
        if not erroroutput:
            return retval
        else:
            raise OSError("Executing the command caused an error: %s" % erroroutput)
    
    # similar doc_to_text_antiword()
    

    The -w switch to catdoc turns off line wrapping, BTW.

提交回复
热议问题