I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
if you need more speed and less accuracy then you could use raw lxml.
import lxml.html as lh from lxml.html.clean import clean_html def lxml_to_text(html): doc = lh.fromstring(html) doc = clean_html(doc) return doc.text_content()