When trying to extract the title of a html-page I have always used the following regex:
(?<=)([\\s\\S]*)(?=)
<
What about something like:
r = re.compile("(<title.*>)([\s\S]*)(</title>)")
title = r.search(page).group(2)
If you just want to get the title tag,
html=urllib2.urlopen("http://somewhere").read()
for item in html.split("</title>"):
if "<title>" in item:
print item[ item.find("<title>")+7: ]
Toss out the idea of parsing HTML with regular expressions and use an actual HTML parsing library instead. After a quick search I found this one. It's a much safer way to extract information from an HTML file.
Remember, HTML is not a regular language so regular expressions are fundamentally the wrong tool for extracting information from it.
The regex for extracting the content of non-nested HTML/XML tags is actually very simple:
r = re.compile('<title[^>]*>(.*?)</title>')
However, for anything more complex, you should really use a proper DOM parser like urllib or BeautifulSoup.
Here's a famous answer on parsing html with regular expressions that does a great job of saying, "don't use regex to parse html."