I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
I know there's plenty of answers here already but I think newspaper3k also deserves a mention. I recently needed to complete a similar task of extracting the text from articles on the web and this library has done an excellent job of achieving this so far in my tests. It ignores the text found in menu items and side bars as well as any JavaScript that appears on the page as the OP requests.
from newspaper import Article
article = Article(url)
article.download()
article.parse()
article.text
If you already have the HTML files downloaded you can do something like this:
article = Article('')
article.set_html(html)
article.parse()
article.text
It even has a few NLP features for summarizing the topics of articles:
article.nlp()
article.summary