I am trying to crawl wikipedia to get some data for text mining. I am using python\'s urllib2 and Beautifulsoup. My question is that: is there an easy way of getting rid of
This is how you could do it with lxml (and the lovely requests):
import requests
import lxml.html as lh
from BeautifulSoup import UnicodeDammit
URL = "http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes"
HEADERS = {'User-agent': 'Mozilla/5.0'}
def lhget(*args, **kwargs):
r = requests.get(*args, **kwargs)
html = UnicodeDammit(r.content).unicode
tree = lh.fromstring(html)
return tree
def remove(el):
el.getparent().remove(el)
tree = lhget(URL, headers=HEADERS)
el = tree.xpath("//div[@class='mw-content-ltr']/p")[0]
for ref in el.xpath("//sup[@class='reference']"):
remove(ref)
print lh.tostring(el, pretty_print=True)
print el.text_content()