I am learning to use both the re
module and the urllib
module in python and attempting to write a simple web scraper. Here\'s the code I\'ve writte
You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.
Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.
BeautifulSoup example:
from bs4 import BeautifulSoup
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text
Since a title
tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.
Your specific problem can be solved by matching additional characters within the title
tag, optionally:
r']*>([^<]+) '
This matches 0 or more characters that are not the closing >
bracket. The '0 or more' here lets you match both extra attributes and the plain
tag.