, not picking up subsequent paragraphs
Firstly, I am a complete newbie when it comes to Python. However, I have written a piece of code to look at an RSS feed, open the link and extract the text from the article.
This works well for specific articles where the text is all wrapped in tags. Since the web is an ugly place, it's not always the case.
Often, websites will have text scattered all over, wrapped in different types of tags (e.g. maybe in a To find all text nodes in the DOM, you can use This is going to return some undesired text, like the contents of If you are working with a known set of tags, you can tag the opposite approach: or a
).
soup.find_all(text=True)
. and
tags. You'll need to filter out the text contents of elements you don't want.
blacklist = [
'style',
'script',
# other elements,
]
text_elements = [t for t in soup.find_all(text=True) if t.parent.name not in blacklist]
whitelist = [
'p'
]
text_elements = [t for t in soup.find_all(text=True) if t.parent.name in whitelist]