python convert microsoft office docs to plain text on linux

前端 未结 7 598
庸人自扰
庸人自扰 2020-12-06 05:49

Any recomendations on a method to convert .doc, .ppt, and .xls to plain text on linux using python? Really any method of conversion would be useful. I have already looked at

相关标签:
7条回答
  • 2020-12-06 06:24

    I've had some success at using XSLT to process the XML-based office files into something usable in the past. It's not necessarily a python-based solution, but it does get the job done.

    0 讨论(0)
  • 2020-12-06 06:30

    At the command line, antiword or wv work very nicely for .doc files. (Not a Python solution, but they're easy to install and fast.)

    0 讨论(0)
  • 2020-12-06 06:32

    Same problem here. Below is my simple script to convert all doc files in dir 'docs/' to dir 'txts/' using catdoc. Hope it will help someone:

    #!/usr/bin/env python 
    # -*- coding: utf-8 -*-
    
    import glob, re, os
    f = glob.glob('docs/*.doc') + glob.glob('docs/*.DOC')
    
    outDir = 'txts'
    if not os.path.exists(outDir):
        os.makedirs(outDir)
    for i in f:
        os.system("catdoc -w '%s' > '%s'" %
                  (i, outDir + '/' + re.sub(r'.*/([^.]+)\.doc', r'\1.txt', i,
                                       flags=re.IGNORECASE)))
    
    0 讨论(0)
  • 2020-12-06 06:34

    I'd go for the command line-solution (and then use the Python subprocess module to run the tools from Python).

    Convertors for msword (catdoc), excel (xls2csv) and ppt (catppt) can be found (in source form) here: http://vitus.wagner.pp.ru/software/catdoc/.

    Can't really comment on the usefullness of catppt but catdoc and xls2csv work great!

    But be sure to first search your distributions repositories... On ubuntu for example catdoc is just one fast apt-get away.

    0 讨论(0)
  • 2020-12-06 06:44

    For dealing with Excel Spreadsheets xlwt is good. But it won't help with .doc and .ppt files.

    (You may have also heard of PyExcelerator. xlwt is a fork of this and better maintained so I think you'd be better of with xlwt.)

    0 讨论(0)
  • 2020-12-06 06:46

    The usual tool for converting Microsoft Office documents to HTML or other formats was mswordview, which has since been renamed to vwWare.

    If you're looking for a command-line tool, they actually recommend using AbiWord to perform the conversion:

    AbiWord --to=txt
    

    If you're looking for a library, start on the wvWare overview page. They also maintain a list of libraries and tools which read MS Office documents.

    0 讨论(0)
提交回复
热议问题