Best way to extract text from a Word doc without using COM/automation?

后端 未结 10 1822
遇见更好的自我
遇见更好的自我 2020-12-07 21:29

Is there a reasonable way to extract plain text from a Word file that doesn\'t depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platfo

相关标签:
10条回答
  • 2020-12-07 21:49

    Open Office has an API

    0 讨论(0)
  • 2020-12-07 21:52

    Using the OpenOffice API, and Python, and Andrew Pitonyak's excellent online macro book I managed to do this. Section 7.16.4 is the place to start.

    One other tip to make it work without needing the screen at all is to use the Hidden property:

    RO = PropertyValue('ReadOnly', 0, True, 0)
    Hidden = PropertyValue('Hidden', 0, True, 0)
    xDoc = desktop.loadComponentFromURL( docpath,"_blank", 0, (RO, Hidden,) )
    

    Otherwise the document flicks up on the screen (probably on the webserver console) when you open it.

    0 讨论(0)
  • 2020-12-07 21:54

    For docx files, check out the Python script docx2txt available at

    http://cobweb.ecn.purdue.edu/~kak/distMisc/docx2txt

    for extracting the plain text from a docx document.

    0 讨论(0)
  • 2020-12-07 21:55

    (Same answer as extracting text from MS word files in python)

    Use the native Python docx module which I made this week. Here's how to extract all the text from a doc:

    document = opendocx('Hello world.docx')
    
    # This location is where most document content lives 
    docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]
    
    # Extract all text
    print getdocumenttext(document)
    

    See Python DocX site

    100% Python, no COM, no .net, no Java, no parsing serialized XML with regexs, no crap.

    0 讨论(0)
提交回复
热议问题