Python : How to convert markdown formatted text to text

后端 未结 4 1357
迷失自我
迷失自我 2020-12-23 09:23

I need to convert markdown text to plain text format to display summary in my website. I want the code in python.

相关标签:
4条回答
  • 2020-12-23 10:02

    This module will help do what you describe:

    http://www.freewisdom.org/projects/python-markdown/Using_as_a_Module

    Once you have converted the markdown to HTML, you can use a HTML parser to strip out the plain text.

    Your code might look something like this:

    from BeautifulSoup import BeautifulSoup
    from markdown import markdown
    
    html = markdown(some_html_string)
    text = ''.join(BeautifulSoup(html).findAll(text=True))
    
    0 讨论(0)
  • 2020-12-23 10:03

    This is similar to Jason's answer, but handles comments correctly.

    import markdown # pip install markdown
    from bs4 import BeautifulSoup # pip install beautifulsoup4
    
    def md_to_text(md):
        html = markdown.markdown(md)
        soup = BeautifulSoup(html, features='html.parser')
        return soup.get_text()
    
    def example():
        md = '**A** [B](http://example.com) <!-- C -->'
        text = md_to_text(md)
        print(text)
        # Output: A B
    
    0 讨论(0)
  • 2020-12-23 10:08

    Despite the fact that this is a very old question, I'd like to suggest a solution I came up with recently. This one neither uses BeautifulSoup nor has an overhead of converting to html and back.

    The markdown module core class Markdown has a property output_formats which is not configurable but otherwise patchable like almost anything in python is. This property is a dict mapping output format name to a rendering function. By default it has two output formats, 'html' and 'xhtml' correspondingly. With a little help it may have a plaintext rendering function which is easy to write:

    from markdown import Markdown
    from io import StringIO
    
    
    def unmark_element(element, stream=None):
        if stream is None:
            stream = StringIO()
        if element.text:
            stream.write(element.text)
        for sub in element:
            unmark_element(sub, stream)
        if element.tail:
            stream.write(element.tail)
        return stream.getvalue()
    
    
    # patching Markdown
    Markdown.output_formats["plain"] = unmark_element
    __md = Markdown(output_format="plain")
    __md.stripTopLevelTags = False
    
    
    def unmark(text):
        return __md.convert(text)
    

    unmark function takes markdown text as an input and returns all the markdown characters stripped out.

    0 讨论(0)
  • 2020-12-23 10:23

    Commented and removed it because I finally think I see the rub here: It may be easier to convert your markdown text to HTML and remove HTML from the text. I'm not aware of anything to remove markdown from text effectively but there are many HTML to plain text solutions.

    0 讨论(0)
提交回复
热议问题