Use SGMLParser
. regex
works in simple case. But there are a lot of intricacy with HTML you rather not have to deal with.
>>> from sgmllib import SGMLParser
>>>
>>> class TextExtracter(SGMLParser):
... def __init__(self):
... self.text = []
... SGMLParser.__init__(self)
... def handle_data(self, data):
... self.text.append(data)
... def getvalue(self):
... return ''.join(ex.text)
...
>>> ex = TextExtracter()
>>> ex.feed('<html>hello > world</html>')
>>> ex.getvalue()
'hello > world'