I\'m working on a small Python script to clean up HTML documents. It works by accepting a list of tags to KEEP and then parsing through the HTML code trashing tags that are
You may also consider using the html parser that is built into python (Documentation for Python 2 and Python 3)
This will help you home in on the specific area of the HTML Document you would like to work on - and use regular expressions on it.
Read:
Repent.
Use a real HTML parser, like BeautifulSoup.
Don't use regex to parse HTML. It will only give you headaches.
Use an XML parser instead. Try BeautifulSoup or lxml.
<TAG\b[^>]*>(.*?)</TAG>
Matches the opening and closing pair of a specific HTML tag.
<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
Will match the opening and closing pair of any HTML tag.
See here.