I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
There is Pattern library for data mining.
http://www.clips.ua.ac.be/pages/pattern-web
You can even decide what tags to keep:
s = URL('http://www.clips.ua.ac.be').download() s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']}) print s