BeautifulSoup getText from between

, not picking up subsequent paragraphs

后端 未结 2 985
南方客
南方客 2020-12-23 17:18

Firstly, I am a complete newbie when it comes to Python. However, I have written a piece of code to look at an RSS feed, open the link and extract the text from the article.

相关标签:
2条回答
  • 2020-12-23 17:59

    This works well for specific articles where the text is all wrapped in <p> tags. Since the web is an ugly place, it's not always the case.

    Often, websites will have text scattered all over, wrapped in different types of tags (e.g. maybe in a <span> or a <div>, or an <li>).

    To find all text nodes in the DOM, you can use soup.find_all(text=True).

    This is going to return some undesired text, like the contents of <script> and <style> tags. You'll need to filter out the text contents of elements you don't want.

    blacklist = [
      'style',
      'script',
      # other elements,
    ]
    
    text_elements = [t for t in soup.find_all(text=True) if t.parent.name not in blacklist]
    

    If you are working with a known set of tags, you can tag the opposite approach:

    whitelist = [
      'p'
    ]
    
    text_elements = [t for t in soup.find_all(text=True) if t.parent.name in whitelist]
    
    0 讨论(0)
  • 2020-12-23 18:08

    You are getting close!

    # Find all of the text between paragraph tags and strip out the html
    page = soup.find('p').getText()
    

    Using find (as you've noticed) stops after finding one result. You need find_all if you want all the paragraphs. If the pages are formatted consistently ( just looked over one), you could also use something like

    soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})
    

    to zero in on the body of the article.

    0 讨论(0)
提交回复
热议问题