问题
I tried reading a .doc
file like -
with open('file.doc', errors='ignore') as f:
text = f.read()
It did read that file but with huge junk, I can't remove that junk as I don't know from where it starts and where it ends.
I also tried installing textract
module which says it can read from any file format but there were many dependency issues while downloading it in Windows.
So I alternately did this with antiword
command line utility, my answer is below.
回答1:
You can use antiword
command line utility to do this, I know most of you would have tried it but still I wanted to share.
Download
antiword
from hereExtract and paste
antiword
folder inC:\
drive and put this pathC:\antiword
inPATH
variable.Now python code -
import os, docx2txt def get_doc_text(filepath, file): if file.endswith('.docx'): text = docx2txt.process(file) return text elif file.endswith('.doc'): # converting .doc to .docx doc_file = filepath + file docx_file = filepath + file + 'x' if not os.path.exists(docx_file): os.system('antiword ' + doc_file + ' > ' + docx_file) with open(docx_file) as f: text = f.read() os.remove(docx_file) #docx_file was just to read, so deleting else: # already a file with same name as doc exists having docx extension, # which means it is a different file, so we cant read it print('Info : file with same name of doc exists having docx extension, so we cant read it') text = '' return text
Now call this function -
filepath = "D:\\input\\" files = os.listdir(filepath) for file in files: text = get_doc_text(filepath, file) print(text)
This could be good alternate way to read .doc
file in Python
on Windows
.
Hope it helps, Thanks.
来源:https://stackoverflow.com/questions/51727237/reading-doc-file-in-python-using-antiword-in-windows-also-docx