When trying to extract the title of a html-page I have always used the following regex:
(?<=)([\\s\\S]*)(?=)
The regex for extracting the content of non-nested HTML/XML tags is actually very simple:
r = re.compile(']*>(.*?)')
However, for anything more complex, you should really use a proper DOM parser like urllib or BeautifulSoup.