Strip HTML from strings in Python

前端 未结 26 2649
难免孤独
难免孤独 2020-11-22 02:50
from mechanize import Browser
br = Browser()
br.open(\'http://somewebpage\')
html = br.response().readlines()
for line in html:
  print line

When p

26条回答
  •  独厮守ぢ
    2020-11-22 02:55

    I'm parsing Github readmes and I find that the following really works well:

    import re
    import lxml.html
    
    def strip_markdown(x):
        links_sub = re.sub(r'\[(.+)\]\([^\)]+\)', r'\1', x)
        bold_sub = re.sub(r'\*\*([^*]+)\*\*', r'\1', links_sub)
        emph_sub = re.sub(r'\*([^*]+)\*', r'\1', bold_sub)
        return emph_sub
    
    def strip_html(x):
        return lxml.html.fromstring(x).text_content() if x else ''
    

    And then

    readme = """
    
                sky is a web scraping framework, implemented with the latest python versions in mind (3.4+). 
                It uses the asynchronous `asyncio` framework, as well as many popular modules 
                and extensions.
    
                Most importantly, it aims for **next generation** web crawling where machine intelligence 
                is used to speed up the development/maintainance/reliability of crawling.
    
                It mainly does this by considering the user to be interested in content 
                from *domains*, not just a collection of *single pages*
                ([templating approach](#templating-approach))."""
    
    strip_markdown(strip_html(readme))
    

    Removes all markdown and html correctly.

提交回复
热议问题