Best way to extract text from a Word doc without using COM/automation?

后端 未结 10 1816
遇见更好的自我
遇见更好的自我 2020-12-07 21:29

Is there a reasonable way to extract plain text from a Word file that doesn\'t depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platfo

10条回答
  •  -上瘾入骨i
    2020-12-07 21:55

    (Same answer as extracting text from MS word files in python)

    Use the native Python docx module which I made this week. Here's how to extract all the text from a doc:

    document = opendocx('Hello world.docx')
    
    # This location is where most document content lives 
    docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]
    
    # Extract all text
    print getdocumenttext(document)
    

    See Python DocX site

    100% Python, no COM, no .net, no Java, no parsing serialized XML with regexs, no crap.

提交回复
热议问题