问题
I've got employer IDs, which can be utilized get the business area:
https://www.careerinfonet.org/employ4.asp?emp_id=558742391
The HTML contains the data in tr/td tables:
Business Description:
Exporters (Whls) Primary Industry:Other Miscellaneous Durable Goods Merchant Wholesalers
Related Industry:Sporting and Athletic Goods Manufacturing
So I would like to get
- Exporters (Whls)
- Other Miscellaneous Durable Goods Merchant Wholesalers
- Sporting and Athletic Goods Manufacturing
My example code looks like this:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.careerinfonet.org/employ4.asp?emp_id=558742391")
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find('td', class_='content')
for td in div.find_all('td'):
print(td.text)
回答1:
I would like to preface this by saying that this technique is fairly sloppy, but it gets the job done assuming each page you scrape has a similar set up.
Your code is excellent for accessing the page itself, I simply add a check for every element to determine if it is the "Business Description", or the "Primary" or "Related Industry". Then you can access the appropriate element and use that.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.careerinfonet.org/employ4.asp?emp_id=558742391")
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find('td', class_='content')
lst = div.find_all('td')
for td in lst:
if td.text == "Business Description:":
print(lst[lst.index(td)+1].text)
if td.text == "Primary Industry:":
print(lst[lst.index(td)+1].text)
if td.text == "Related Industry:":
print(lst[lst.index(td)+1].text)
The other small modification I made is putting div.find_all('td') in a list that can then be indexed, to access the element you want.
Hope it helps!
来源:https://stackoverflow.com/questions/47184202/scrape-americas-career-infonet