问题
i am trying to scrape a little chunk of information from a site: but it keeps printing "None" as if the title, or any tag if i replace it, doesn't exists.
The project: for a list of meta-data of wordpress-plugins: - approx 50 plugins are of interest! but the challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality...
https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
https://wordpress.org/plugins/participants-database ....and so on and so forth.
we have the following set of meta-data for each wordpress-plugin:
Version: 1.9.5.12
installations: 10,000+
WordPress Version: 5.0 or higher
Tested up to: 5.4 PHP
Version: 5.6 or higher
Tags 3 Tags:databasemembersign-up formvolunteer
Last updated: 19 hours ago
enter code here
the project consits of two parts: the looping-part: (which seems to be pretty straightforward). the parser-part: where i have some issues - see below. I'm trying to loop through an array of URLs and scrape the data below from a list of wordpress-plugins. See my loop below-
from bs4 import BeautifulSoup
import requests
#array of URLs to loop through, will be larger once I get the loop working correctly
plugins = ['https://wordpress.org/plugins/wp-job-manager', 'https://wordpress.org/plugins/ninja-forms']
this can be done like so
ttt = page_soup.find("div", {"class":"plugin-meta"})
text_nodes = [node.text.strip() for node in ttt.ul.findChildren('li')[:-1:2]]
the Output of text_nodes:
['Version: 1.9.5.12', 'Active installations: 10,000+', 'Tested up to: 5.6 ']
but if we want to fetch the data of all the wordpress-plugins and subesquently sort them to show the -let us say - latest 50 updated plugins. This would be a interesting task:
first of all we need to fetch the urls
then we fetch the information and have to sort out the newest- the newest timestamp. Ie the plugin that updated most recently
List the 50 newest items - that are the 50 plugins that are updated recently ...
challenge: how to avoid that we overload the RAM while fetching all URLs. (see here How extract all URLs in a website using BeautifulSoup with interesting insights, approaches and ideas.
at the moment i try to figure out how to fetch all the urls -and to parse them:
a. how to fetch the meta-data of each plugin:
b. and how to sort out the range of the newest updates…
c. afterward how to pick out the 50 newest
回答1:
import requests
from bs4 import BeautifulSoup
from concurrent.futures.thread import ThreadPoolExecutor
url = "https://wordpress.org/plugins/browse/popular/{}"
def main(url, num):
with requests.Session() as req:
print(f"Collecting Page# {num}")
r = req.get(url.format(num))
soup = BeautifulSoup(r.content, 'html.parser')
link = [item.get("href")
for item in soup.findAll("a", rel="bookmark")]
return set(link)
with ThreadPoolExecutor(max_workers=20) as executor:
futures = [executor.submit(main, url, num)
for num in [""]+[f"page/{x}/" for x in range(2, 50)]]
allin = []
for future in futures:
allin.extend(future.result())
def parser(url):
with requests.Session() as req:
print(f"Extracting {url}")
r = req.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
target = [item.get_text(strip=True, separator=" ") for item in soup.find(
"h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
head = [soup.find("h1", class_="plugin-title").text]
new = [x for x in target if x.startswith(
("V", "Las", "Ac", "W", "T", "P"))]
return head + new
with ThreadPoolExecutor(max_workers=50) as executor1:
futures1 = [executor1.submit(parser, url) for url in allin]
for future in futures1:
print(future.result())
Output: view-online
来源:https://stackoverflow.com/questions/61106309/fetching-multiple-urls-with-beautifulsoup-gathering-meta-data-in-wp-plugins