问题
I try to parse https://www.drugbank.ca/drugs. The idea is to extract all the drug names and some additional informationfor each drug. As you can see each webpage represents a table with drug names and the when we hit the drugname we can access to this drug information. Let's say I will keep the following code to handle the pagination:
import requests
from bs4 import BeautifulSoup
def drug_data():
url = 'https://www.drugbank.ca/drugs/'
while url:
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
#data = soup.select('name-head a')
#for link in data:
# href = 'https://www.drugbank.ca/drugs/' + link.get('href')
# pages_data(href)
# next page url
url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
print(url)
if url:
url = 'https://www.drugbank.ca' + url[0].get('href')
else:
break
drug_data()
The issue is that in each page, and for each drug in the table of this page I need to capture : Name. Accession Number. Structured Indications, Generic Prescription Products,
I used the classical request/beautifusoup but can't go deep ..
Some Help please
回答1:
Create function with requests and BeautifulSoup to get data from subpage
import requests
from bs4 import BeautifulSoup
def get_details(url):
print('details:', url)
# get subpage
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
# get data on subpabe
dts = soup.findAll('dt')
dds = soup.findAll('dd')
# display details
for dt, dd in zip(dts, dds):
print(dt.text)
print(dd.text)
print('---')
print('---------------------------')
def drug_data():
url = 'https://www.drugbank.ca/drugs/'
while url:
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.text ,"lxml")
# get links to subpages
links = soup.select('strong a')
for link in links:
# exeecute function to get subpage
get_details('https://www.drugbank.ca' + link['href'])
# next page url
url = soup.findAll('a', {'class': 'page-link', 'rel': 'next'})
print(url)
if url:
url = 'https://www.drugbank.ca' + url[0].get('href')
else:
break
drug_data()
回答2:
To crawl effectively, you'll want to implement a few measures, such as maintaining a queue of urls to visit and be aware of what urls you have already visited.
Keeping in mind that links can be absolute or relative and that redirects are very likely, you also probably want to construct the urls dynamically rather than string concatenation.
Here is a generic (we usually only want to use example.com on SO) crawling workflow...
from urllib.parse import urljoin, urlparse # python
# from urlparse import urljoin, urlparse # legacy python2
import requests
from bs4 import BeautifulSoup
def process_page(soup):
'''data extraction process'''
pass
def is_external(link, base='example.com'):
'''determine if the link is external to base'''
site = urlparse(link).netloc
return base not in site
def resolve_link(current_location, href):
'''resolves final location of a link including redirects'''
req_loc = urljoin(current_location, href)
response = requests.head(req_loc)
resolved_location = response.url # location after redirects
# if you don't want to visit external links...
if is_external(resolved_location):
return None
return resolved_location
url_queue = ['https://example.com']
visited = set()
while url_queue:
url = url_queue.pop() # removes a url from the queue and assign it to `url`
response = requests.get(url)
current_location = response.url # final location after redirects
visited.add(url) # note that we've visited the given url
visited.add(current_location) # and the final location
soup = BeautifulSoup(response.text, 'lxml')
process_page(soup) # scrape the page
link_tags = soup.find_all('a') # gather additional links
for anchor in link_tags:
href = anchor.get('href')
link_location = resolve_link(current_location, href)
if link_location and link_location not in visited:
url_queue.append(link_location)
来源:https://stackoverflow.com/questions/48116316/deep-parse-with-beautifulsoup