Removing html tags when crawling wikipedia with python's urllib2 and Beautifulsoup

前端未结

关注

 3  639

感动是毒 2021-01-14 10:03

I am trying to crawl wikipedia to get some data for text mining. I am using python\'s urllib2 and Beautifulsoup. My question is that: is there an easy way of getting rid of

3条回答

萌比男神i (楼主)

2021-01-14 10:25

This is how you could do it with lxml (and the lovely requests):

import requests
import lxml.html as lh
from BeautifulSoup import UnicodeDammit

URL = "http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes"
HEADERS = {'User-agent': 'Mozilla/5.0'}

def lhget(*args, **kwargs):
    r = requests.get(*args, **kwargs)
    html = UnicodeDammit(r.content).unicode
    tree = lh.fromstring(html)
    return tree

def remove(el):
    el.getparent().remove(el)

tree = lhget(URL, headers=HEADERS)

el = tree.xpath("//div[@class='mw-content-ltr']/p")[0]

for ref in el.xpath("//sup[@class='reference']"):
    remove(ref)

print lh.tostring(el, pretty_print=True)

print el.text_content()

0 讨论(0)

查看其它3个回答