Python BeautifulSoup Paragraph Text only

时光总嘲笑我的痴心妄想 提交于 2021-01-28 14:50:31

问题


I am very new to anything webscraping related and as I understand Requests and BeautifulSoup are the way to go in that. I want to write a program which emails me only one paragraph of a given link every couple of hours (trying a new way to read blogs through the day) Say this particular link 'https://fs.blog/mental-models/' has a a paragraph each on different models.

from bs4 import BeautifulSoup
import re
import requests


url = 'https://fs.blog/mental-models/'

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

now soup has a wall of bits before the paragraph text begins: <p> this is what I want to read </p>

soup.title.string working perfectly fine, but I don't know how to move ahead from here pls.. any directions?

thanks


回答1:


Loop over the soup.findAll('p') to find all the p tags and then use .text to get their text:

Furthermore, do all that under a div with the class rte since you don't want the footer paragraphs.

from bs4 import BeautifulSoup
import requests

url = 'https://fs.blog/mental-models/'    
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

divTag = soup.find_all("div", {"class": "rte"})    
for tag in divTag:
    pTags = tag.find_all('p')
    for tag in pTags[:-2]:  # to trim the last two irrelevant looking lines
        print(tag.text)

OUTPUT:

Mental models are how we understand the world. Not only do they shape what we think and how we understand but they shape the connections and opportunities that we see.
.
.
.
5. Mutually Assured Destruction
Somewhat paradoxically, the stronger two opponents become, the less likely they may be to destroy one another. This process of mutually assured destruction occurs not just in warfare, as with the development of global nuclear warheads, but also in business, as with the avoidance of destructive price wars between competitors. However, in a fat-tailed world, it is also possible that mutually assured destruction scenarios simply make destruction more severe in the event of a mistake (pushing destruction into the “tails” of the distribution).

 




回答2:


If you want the text of all the p tag, you can just loop on them using the find_all method:

from bs4 import BeautifulSoup
import re
import requests


url = 'https://fs.blog/mental-models/'

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup)

data = soup.find_all('p')
for p in data:
    text = p.get_text()
    print(text)

EDIT:

Here is the code in order to have them separatly in a list. You can them apply a loop on the result list to remove empty string, unused characters like\n etc...

from bs4 import BeautifulSoup
import re
import requests


url = 'https://fs.blog/mental-models/'

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

data = soup.find_all('p')
result = []
for p in data:
    result.append(p.get_text())

print(result)



回答3:


Here is the solution:

from bs4 import BeautifulSoup
import requests
import Clock

url = 'https://fs.blog/mental-models/'  
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.find_all('p')

result = []

for p in data:
    result.append(p.get_text())

Clock.schedule_interval(print(result), 60)


来源:https://stackoverflow.com/questions/55217889/python-beautifulsoup-paragraph-text-only

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!