, not picking up subsequent paragraphs
Firstly, I am a complete newbie when it comes to Python. However, I have written a piece of code to look at an RSS feed, open the link and extract the text from the article.
This works well for specific articles where the text is all wrapped in <p>
tags. Since the web is an ugly place, it's not always the case.
Often, websites will have text scattered all over, wrapped in different types of tags (e.g. maybe in a <span>
or a <div>
, or an <li>
).
To find all text nodes in the DOM, you can use soup.find_all(text=True)
.
This is going to return some undesired text, like the contents of <script>
and <style>
tags. You'll need to filter out the text contents of elements you don't want.
blacklist = [
'style',
'script',
# other elements,
]
text_elements = [t for t in soup.find_all(text=True) if t.parent.name not in blacklist]
If you are working with a known set of tags, you can tag the opposite approach:
whitelist = [
'p'
]
text_elements = [t for t in soup.find_all(text=True) if t.parent.name in whitelist]
You are getting close!
# Find all of the text between paragraph tags and strip out the html
page = soup.find('p').getText()
Using find (as you've noticed) stops after finding one result. You need find_all if you want all the paragraphs. If the pages are formatted consistently ( just looked over one), you could also use something like
soup.find('div',{'id':'ctl00_PlaceHolderMain_RichHtmlField1__ControlWrapper_RichHtmlField'})
to zero in on the body of the article.