Strip HTML from strings in Python

前端 未结 26 2880
难免孤独
难免孤独 2020-11-22 02:50
from mechanize import Browser
br = Browser()
br.open(\'http://somewebpage\')
html = br.response().readlines()
for line in html:
  print line

When p

26条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2020-11-22 02:57

    2020 Update

    Use the Mozilla Bleach library, it really lets you customize which tags to keep and which attributes to keep and also filter out attributes based on values

    Here are 2 cases to illustrate

    1) Do not allow any HTML tags or attributes

    Take sample raw text

    raw_text = """
    

    Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETCCryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform’s users. Also as part […]

    The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News.

    """

    2) Remove all HTML tags and attributes from raw text

    # DO NOT ALLOW any tags or any attributes
    from bleach.sanitizer import Cleaner
    cleaner = Cleaner(tags=[], attributes={}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
    print(cleaner.clean(raw_text))
    

    Output

    Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform’s users. Also as part […]
    The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News. 
    

    3 Allow Only img tag with srcset attribute

    from bleach.sanitizer import Cleaner
    # ALLOW ONLY img tags with src attribute
    cleaner = Cleaner(tags=['img'], attributes={'img': ['srcset']}, styles=[], protocols=[], strip=True, strip_comments=True, filters=None)
    print(cleaner.clean(raw_text))
    

    Output

    Cryptocurrency exchange Okex reveals it suffered the $5.6 million loss as a result of the double-spend carried out by the attacker(s) in Ethereum Classic 51% attack. Okex says it fully absorbed the loss as per its user-protection policy while insisting that the attack did not cause any loss to the platform’s users. Also as part […]
    The post Ethereum Classic 51% Attack: Okex Crypto Exchange Suffers $5.6 Million Loss, Contemplates Delisting ETC appeared first on Bitcoin News. 
    

提交回复
热议问题