Reading .doc file in Python using antiword in Windows (also .docx)

帅比萌擦擦* 提交于 2019-12-08 06:32:18

问题


I tried reading a .doc file like -

with open('file.doc', errors='ignore') as f:
    text = f.read()

It did read that file but with huge junk, I can't remove that junk as I don't know from where it starts and where it ends.

I also tried installing textract module which says it can read from any file format but there were many dependency issues while downloading it in Windows.

So I alternately did this with antiword command line utility, my answer is below.


回答1:


You can use antiword command line utility to do this, I know most of you would have tried it but still I wanted to share.

  • Download antiword from here

  • Extract and paste antiword folder in C:\ drive and put this path C:\antiword in PATH variable.

  • Now python code -

    import os, docx2txt
    def get_doc_text(filepath, file):
        if file.endswith('.docx'):
           text = docx2txt.process(file)
           return text
        elif file.endswith('.doc'):
           # converting .doc to .docx
           doc_file = filepath + file
           docx_file = filepath + file + 'x'
           if not os.path.exists(docx_file):
              os.system('antiword ' + doc_file + ' > ' + docx_file)
              with open(docx_file) as f:
                 text = f.read()
              os.remove(docx_file) #docx_file was just to read, so deleting
           else:
              # already a file with same name as doc exists having docx extension, 
              # which means it is a different file, so we cant read it
              print('Info : file with same name of doc exists having docx extension, so we cant read it')
              text = ''
           return text
    
  • Now call this function -

    filepath = "D:\\input\\"
    files = os.listdir(filepath)
    for file in files:
        text = get_doc_text(filepath, file)
        print(text)
    

This could be good alternate way to read .doc file in Python on Windows.

Hope it helps, Thanks.



来源:https://stackoverflow.com/questions/51727237/reading-doc-file-in-python-using-antiword-in-windows-also-docx

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!