Scrape America's Career InfoNet

喜欢而已 提交于 2020-01-06 05:49:35

问题


I've got employer IDs, which can be utilized get the business area:

https://www.careerinfonet.org/employ4.asp?emp_id=558742391

The HTML contains the data in tr/td tables:

    Business Description:
         Exporters (Whls)   Primary Industry:Other Miscellaneous Durable Goods Merchant Wholesalers
     Related Industry:Sporting and Athletic Goods Manufacturing

So I would like to get

  • Exporters (Whls)
  • Other Miscellaneous Durable Goods Merchant Wholesalers
  • Sporting and Athletic Goods Manufacturing

My example code looks like this:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.careerinfonet.org/employ4.asp?emp_id=558742391")
soup = BeautifulSoup(page.text, 'html.parser')

div = soup.find('td', class_='content')    
for td in div.find_all('td'):
    print(td.text)

回答1:


I would like to preface this by saying that this technique is fairly sloppy, but it gets the job done assuming each page you scrape has a similar set up.

Your code is excellent for accessing the page itself, I simply add a check for every element to determine if it is the "Business Description", or the "Primary" or "Related Industry". Then you can access the appropriate element and use that.

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.careerinfonet.org/employ4.asp?emp_id=558742391")
soup = BeautifulSoup(page.text, 'html.parser')

div = soup.find('td', class_='content')  
lst = div.find_all('td')  
for td in lst:
    if td.text == "Business Description:":
        print(lst[lst.index(td)+1].text)
    if td.text == "Primary Industry:":
        print(lst[lst.index(td)+1].text)
    if td.text == "Related Industry:":
        print(lst[lst.index(td)+1].text)

The other small modification I made is putting div.find_all('td') in a list that can then be indexed, to access the element you want.

Hope it helps!



来源:https://stackoverflow.com/questions/47184202/scrape-americas-career-infonet

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!