Converting html to text with Python

后端 未结 9 812
一生所求
一生所求 2020-12-12 17:49

I am trying to convert an html block to text using Python.

Input:

相关标签:
9条回答
  • 2020-12-12 18:50

    It's possible to use BeautifulSoup to remove unwanted scripts and similar, though you may need to experiment with a few different sites to make sure you've covered the different types of things you wish to exclude. Try this:

    from requests import get
    from bs4 import BeautifulSoup as BS
    response = get('http://news.bbc.co.uk/2/hi/health/2284783.stm')
    soup = BS(response.content, "html.parser")
    for child in soup.body.children:
       if child.name == 'script':
           child.decompose() 
    print(soup.body.get_text())
    
    0 讨论(0)
  • 2020-12-12 18:51

    soup.get_text() outputs what you want:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html)
    print(soup.get_text())
    

    output:

    Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
    Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
    Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
    Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
    Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
    

    To keep newlines:

    print(soup.get_text('\n'))
    

    To be identical to your example, you can replace a newline with two newlines:

    soup.get_text().replace('\n','\n\n')
    
    0 讨论(0)
  • 2020-12-12 18:53

    There are some nice things here, and i might as well throw in my solution:

    from html.parser import HTMLParser
    def _handle_data(self, data):
        self.text += data + '\n'
    
    HTMLParser.handle_data = _handle_data
    
    def get_html_text(html: str):
        parser = HTMLParser()
        parser.text = ''
        parser.feed(html)
    
        return parser.text.strip()
    
    0 讨论(0)
提交回复
热议问题