How can doc/docx files be converted to markdown or structured text?

后端 未结 11 1080
难免孤独
难免孤独 2021-01-29 21:45

Is there a program or workflow to convert .doc or .docx files to Markdown or similar text?

PS: Ideally, I would welcome the option that a spec

11条回答
  •  你的背包
    2021-01-29 22:33

    Options

    1. Use a Conversion Tool for multi-file conversion.
    2. Use a WYSIWYG Editor for single files and superior fonts.


    Which Conversion Tools?

    I've tested these three: (1)-Pandoc / (2)-Mammoth / (3)-w2m


    Pandoc

    By far the superior tool for conversions with support for a multitude of file types (see Pandoc's man page for supported file types):

    pandoc -f docx -t gfm somedoc.docx -o somedoc.md
    


    NB
    • To get pandoc to export markdown tables ('pipe_tables' in pandoc) use multimarkdown or gfm output formats.

    • If formatting to PDF, pandoc uses LaTeX templates for this so you may need to install the LaTeX package for your OS if that command does not work out of the box. Instructions at LaTeX Installation


    Which WYSIWYG Editors?

    Writeage

    In answer to this specific question (docx --> markdown), use the Writeage plugin for Microsoft Word. It also works the other way round markdown --> docx.


    If you wish to preserve unicode characters, emojis and maintain superior fonts, you'll get some milage from the editors below when using copy-and-paste operations between file formats. Note, these do not read or write natively to docx.

    • Typora
    • iaWriter
    • Markdown Viewer for Chrome.


    Update: A4 vs US Letter

    For outside the US, set the geometry variable:

    pandoc -s -V geometry:a4paper -o outfile.pdf infile.md
    


    Footnote

    Its worth mentioning here - what's not that obvious when discovering Markdown is that MultiMarkdown is by far the most feature rich markdown format, supporting amongst other things - metadata, table of contents, footnotes, maths, tables and YAML.

    But Github's default format uses gfm which also supports tables. I use gfm for Github/GitLab and MultiMarkdown for everything else.

提交回复
热议问题