Extracting text from HTML file using Python

后端 未结 30 2589
一生所求
一生所求 2020-11-22 04:05

I\'d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.

30条回答
  •  耶瑟儿~
    2020-11-22 04:22

    I know there's plenty of answers here already but I think newspaper3k also deserves a mention. I recently needed to complete a similar task of extracting the text from articles on the web and this library has done an excellent job of achieving this so far in my tests. It ignores the text found in menu items and side bars as well as any JavaScript that appears on the page as the OP requests.

    from newspaper import Article
    
    article = Article(url)
    article.download()
    article.parse()
    article.text
    

    If you already have the HTML files downloaded you can do something like this:

    article = Article('')
    article.set_html(html)
    article.parse()
    article.text
    

    It even has a few NLP features for summarizing the topics of articles:

    article.nlp()
    article.summary
    

提交回复
热议问题