from lxml.html.clean import clean_html, Cleaner
def clean(text):
try:
cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_
I think you should check out Beautiful Soup. Use the advice from this article and strip the HTML elements in the following way:
from BeautifulSoup import BeautifulSoup
''.join(BeautifulSoup(page).findAll(text=True))
Where page
is your string of html.
Should you need further clarification, you can check out the Dive into Python case study on HTML parsing.