strip tags python

后端 未结 9 1616
深忆病人
深忆病人 2020-12-17 23:07

i want the following functionality.

input : this is test  bold text  normal text
expected output: this is test normal text
相关标签:
9条回答
  • 2020-12-17 23:37

    Try with:

    import re
    input = 'this is test <b> bold text </b> normal text'
    output = re.compile(r'<[^<]*?/?>').sub('', input)
    print output
    
    0 讨论(0)
  • 2020-12-17 23:41

    This is working code taken from my project Supybot, so it's fairly well tested:

    class HtmlToText(sgmllib.SGMLParser):
        """Taken from some eff-bot code on c.l.p."""
        entitydefs = htmlentitydefs.entitydefs.copy()
        entitydefs['nbsp'] = ' '
        def __init__(self, tagReplace=' '):
            self.data = []
            self.tagReplace = tagReplace
            sgmllib.SGMLParser.__init__(self)
    
        def unknown_starttag(self, tag, attr):
            self.data.append(self.tagReplace)
    
        def unknown_endtag(self, tag):
            self.data.append(self.tagReplace)
    
        def handle_data(self, data):
            self.data.append(data)
    
        def getText(self):
            text = ''.join(self.data).strip()
            return normalizeWhitespace(text)
    
    def htmlToText(s, tagReplace=' '):
        """Turns HTML into text.  tagReplace is a string to replace HTML tags with.
        """
        x = HtmlToText(tagReplace)
        x.feed(s)
        return x.getText()

    As the docstring notes, it originated with Fredrik Lundh, not me. As they say, great authors steal :)

    0 讨论(0)
  • 2020-12-17 23:42

    If you don't mind Python (although regexps are fairly generic), you can take some inspiration from Django's strip_tags filter.

    Reproduced here for completeness -

    def strip_tags(value):
        """Returns the given HTML with all tags stripped."""
        return re.sub(r'<[^>]*?>', '', force_unicode(value))
    

    EDIT: If you're using this, or any other regexp solution, please keep in mind that it lets through carefully-crafted HTML (see comment) as well as HTML comments and hence should not be used with untrusted input. Consider using some of the beautifulsoup, html5lib or lxml answers for untrusted input instead.

    0 讨论(0)
提交回复
热议问题