BeautifulSoup getText from between
, not picking up subsequent paragraphs

后端未结

关注

 2  989

南方客 2020-12-23 17:18

Firstly, I am a complete newbie when it comes to Python. However, I have written a piece of code to look at an RSS feed, open the link and extract the text from the article.

2条回答

天涯浪人 (楼主)

2020-12-23 17:59

This works well for specific articles where the text is all wrapped in
tags. Since the web is an ugly place, it's not always the case.

Often, websites will have text scattered all over, wrapped in different types of tags (e.g. maybe in a or a
, or an
).

To find all text nodes in the DOM, you can use soup.find_all(text=True).

This is going to return some undesired text, like the contents of

BeautifulSoup getText from between , not picking up subsequent paragraphs

BeautifulSoup getText from between
, not picking up subsequent paragraphs