Removing html tags when crawling wikipedia with python's urllib2 and Beautifulsoup

前端 未结 3 639
感动是毒
感动是毒 2021-01-14 10:03

I am trying to crawl wikipedia to get some data for text mining. I am using python\'s urllib2 and Beautifulsoup. My question is that: is there an easy way of getting rid of

3条回答
  •  萌比男神i
    2021-01-14 10:25

    This is how you could do it with lxml (and the lovely requests):

    import requests
    import lxml.html as lh
    from BeautifulSoup import UnicodeDammit
    
    URL = "http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes"
    HEADERS = {'User-agent': 'Mozilla/5.0'}
    
    def lhget(*args, **kwargs):
        r = requests.get(*args, **kwargs)
        html = UnicodeDammit(r.content).unicode
        tree = lh.fromstring(html)
        return tree
    
    def remove(el):
        el.getparent().remove(el)
    
    tree = lhget(URL, headers=HEADERS)
    
    el = tree.xpath("//div[@class='mw-content-ltr']/p")[0]
    
    for ref in el.xpath("//sup[@class='reference']"):
        remove(ref)
    
    print lh.tostring(el, pretty_print=True)
    
    print el.text_content()
    

提交回复
热议问题