问题
So I have been going to a website to get NDC codes https://ndclist.com/?s=Solifenacin and I need to get 10 digit NDC codes, but on the current webpage there is only 8 digit NDC codes shown like this picture below
So I click on the underlined NDC code. And get this webpage.
So I copy and paste these 2 NDC codes to an excel sheet, and repeat the process for the rest of the codes on the first webpage I've shown. But this process takes a good bit of time, and was wondering if there was a library in Python that could copy and paste the 10 digit NDC codes for me or store them in a list and then I could print the list once I'm finished with all the 8 digit NDC codes on the first page. Would BeautifulSoup work or is there a better library to achieve this process?
EDIT <<<< I actually need to go another level deep and I've been trying to figure it out, but I've been failing, apparently the last level of webpage is this dumb html table, and I only need one element of the table. Here is the last webpage after you click on the 2nd level codes.
Here is the code that I have, but it's returning a tr and None object once I run it.
url ='https://ndclist.com/?s=Trospium'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for b in soup2.select('#product-packages a'):
link_url2 = b['href']
print('Processing link {}... '.format(link_url2))
soup3 = BeautifulSoup(requests.get(link_url2).content, 'html.parser')
for link in soup3.findAll('tr', limit=7)[1]:
print(link.name)
all_data.append(link.name)
print('Trospium')
print(all_data)
回答1:
Yes, BeautifulSoup is ideal in this case. This script will print all 10 digits codes from the page:
import requests
from bs4 import BeautifulSoup
url = 'https://ndclist.com/?s=Solifenacin'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for link in soup2.select('#product-packages a'):
print(link.text)
all_data.append(link.text)
# In all_data you have all codes, uncoment to print them:
# print(all_data)
Prints:
Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56
0093-5263-98
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56
0093-5264-98
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19
Processin link https://ndclist.com/ndc/27241-037...
27241-037-03
27241-037-09
... and so on.
EDIT: (Version where I get the description too):
import requests
from bs4 import BeautifulSoup
url = 'https://ndclist.com/?s=Solifenacin'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for code, desc in zip(soup2.select('a > h4'), soup2.select('a + p.gi-1x')):
code = code.get_text(strip=True).split(maxsplit=1)[-1]
desc = desc.get_text(strip=True).split(maxsplit=2)[-1]
print(code, desc)
all_data.append((code, desc))
# in all_data you have all codes:
# print(all_data)
Prints:
Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5263-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5264-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19 90 TABLET, FILM COATED in 1 BOTTLE
...and so on.
来源:https://stackoverflow.com/questions/61761961/is-there-a-way-to-parse-data-from-multiple-pages-from-a-parent-webpage