Python regex - stripping out HTML tags and formatting characters from inner HTML

后端 未结 1 517
故里飘歌
故里飘歌 2021-01-29 04:37

I\'m dealing with single HTML strings like this

>> s = \'u>
\\n Some text

1条回答
  •  暗喜
    暗喜 (楼主)
    2021-01-29 04:51

    If I understand you right, you're looking to take this input:

    u>
    \n Some text

    And receive this output:

    \n                                    Some text 
    

    This is done simply enough by only caring about what comes between the two inward-pointing brackets. We want:

    • A right-bracket > (so we know where to begin)
    • Some text \n Some text (the content) which does not contain a left-bracket
    • A left-bracket < (so we know where to end)

    You want:

    >>> s = 'u>
    \n Some text

    >> re.search(r'>([^<]+)<', s) <_sre.SRE_Match object; span=(6, 55), match='>\n Some text >

    (The captured group can be accessed via .group(1).)

    Additionally, you may want to use re.findall if you expect there to be multiple matches per line:

    >>> re.findall(r'>([^<]+)<', s)
    ['\n                                    Some text ']
    

    EDIT: To address the comment: If you have multiple matches and you want to connect them into a single string (effectively removing all HTML-like tag things), do:

    >>> s = 'nbsp;

    Some text.
    Some \n more text.>> ' '.join(re.findall(r'>([^<]+)<', s)) 'Some text. Some \n more text.'

    0 讨论(0)
提交回复
热议问题