How do I write a python script that can read doc/docx files and convert them to txt?

前端 未结 3 1071
终归单人心
终归单人心 2021-01-29 09:26

Basically I have a folder with plenty of .doc/.docx files. I need them in .txt format. The script should iterate over all the files in a directory, convert them to .txt files an

3条回答
  •  难免孤独
    2021-01-29 10:08

    I figured this would make an interesting quick programming project. This has only been tested on a simple .docx file containing "Hello, world!", but the train of logic should give you a place to work from to parse more complex documents.

    from shutil import copyfile, rmtree
    import sys
    import os
    import zipfile
    from lxml import etree
    
    # command format: python3 docx_to_txt.py Hello.docx
    
    # let's get the file name
    zip_dir = sys.argv[1]
    # cut off the .docx, make it a .zip
    zip_dir_zip_ext = os.path.splitext(zip_dir)[0] + '.zip'
    # make a copy of the .docx and put it in .zip
    copyfile(zip_dir, zip_dir_zip_ext)
    # unzip the .zip
    zip_ref = zipfile.ZipFile(zip_dir_zip_ext, 'r')
    zip_ref.extractall('./temp')
    # get the xml out of /word/document.xml
    data = etree.parse('./temp/word/document.xml')
    # we'll want to go over all 't' elements in the xml node tree.
    # note that MS office uses namespaces and that the w must be defined in the namespaces dictionary args
    # each :t element is the "text" of the file. that's what we're looking for
    # result is a list filled with the text of each t node in the xml document model
    result = [node.text.strip() for node in data.xpath("//w:t", namespaces={'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})]
    # dump result into a new .txt file
    with open(os.path.splitext(zip_dir)[0]+'.txt', 'w') as txt:
        # join the elements of result together since txt.write can't take lists
        joined_result = '\n'.join(result)
        # write it into the new file
        txt.write(joined_result)
    # close the zip_ref file
    zip_ref.close()
    # get rid of our mess of working directories
    rmtree('./temp')
    os.remove(zip_dir_zip_ext)
    

    I'm sure there's a more elegant or pythonic way to accomplish this. You'll need to have the file you want to convert in the same directory as the python file. Command format is python3 docx_to_txt.py file_name.docx

提交回复
热议问题