BeautifulSoup getText from between
, not picking up subsequent paragraphs

后端未结

关注

 2  985

Firstly, I am a complete newbie when it comes to Python. However, I have written a piece of code to look at an RSS feed, open the link and extract the text from the article.

相关标签:

2条回答

天涯浪人

2020-12-23 17:59
This works well for specific articles where the text is all wrapped in <p> tags. Since the web is an ugly place, it's not always the case.

Often, websites will have text scattered all over, wrapped in different types of tags (e.g. maybe in a <span> or a <div>, or an <li>).

To find all text nodes in the DOM, you can use soup.find_all(text=True).

This is going to return some undesired text, like the contents of <script> and <style> tags. You'll need to filter out the text contents of elements you don't want.
```
blacklist = [
  'style',
  'script',
  # other elements,
]

text_elements = [t for t in soup.find_all(text=True) if t.parent.name not in blacklist]
```
If you are working with a known set of tags, you can tag the opposite approach:
```
whitelist = [
  'p'
]

text_elements = [t for t in soup.find_all(text=True) if t.parent.name in whitelist]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
小鲜肉

2020-12-23 18:08
You are getting close!
```
# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()
```
Using find (as you've noticed) stops after finding one result. You need find_all if you want all the paragraphs. If the pages are formatted consistently ( just looked over one), you could also use something like
```
soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})
```
to zero in on the body of the article.
0 讨论(0)
发布评论:

提交评论
- 加载中...

BeautifulSoup getText from between , not picking up subsequent paragraphs

BeautifulSoup getText from between
, not picking up subsequent paragraphs