python convert microsoft office docs to plain text on linux

前端 未结 7 642
庸人自扰
庸人自扰 2020-12-06 05:49

Any recomendations on a method to convert .doc, .ppt, and .xls to plain text on linux using python? Really any method of conversion would be useful. I have already looked at

7条回答
  •  情书的邮戳
    2020-12-06 06:32

    Same problem here. Below is my simple script to convert all doc files in dir 'docs/' to dir 'txts/' using catdoc. Hope it will help someone:

    #!/usr/bin/env python 
    # -*- coding: utf-8 -*-
    
    import glob, re, os
    f = glob.glob('docs/*.doc') + glob.glob('docs/*.DOC')
    
    outDir = 'txts'
    if not os.path.exists(outDir):
        os.makedirs(outDir)
    for i in f:
        os.system("catdoc -w '%s' > '%s'" %
                  (i, outDir + '/' + re.sub(r'.*/([^.]+)\.doc', r'\1.txt', i,
                                       flags=re.IGNORECASE)))
    

提交回复
热议问题