Any recomendations on a method to convert .doc, .ppt, and .xls to plain text on linux using python? Really any method of conversion would be useful. I have already looked at
Same problem here. Below is my simple script to convert all doc files in dir 'docs/' to dir 'txts/' using catdoc. Hope it will help someone:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import glob, re, os
f = glob.glob('docs/*.doc') + glob.glob('docs/*.DOC')
outDir = 'txts'
if not os.path.exists(outDir):
os.makedirs(outDir)
for i in f:
os.system("catdoc -w '%s' > '%s'" %
(i, outDir + '/' + re.sub(r'.*/([^.]+)\.doc', r'\1.txt', i,
flags=re.IGNORECASE)))