from lxml.html.clean import clean_html, Cleaner
def clean(text):
try:
cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_
I think you should check out Beautiful Soup. Use the advice from this article and strip the HTML elements in the following way:
from BeautifulSoup import BeautifulSoup
''.join(BeautifulSoup(page).findAll(text=True))
Where page
is your string of html.
Should you need further clarification, you can check out the Dive into Python case study on HTML parsing.
solution from David concatenates the text with no separator:
import lxml.html
document = lxml.html.document_fromstring(html_string)
# internally does: etree.XPath("string()")(document)
print document.text_content()
but this one helped me - concatenation the way I needed:
from lxml import etree
print "\n".join(etree.XPath("//text()")(document))
Not sure if this method existed around the time you made your question, but if you go through
document = lxml.html.document_fromstring(html_text)
raw_text = document.text_content()
That should return you all the text content in the html document, minus all the markup.