How to select some urls with BeautifulSoup?

扶醉桌前 提交于 2020-01-01 18:48:25

问题


I want to scrape the following information except the last row and "class="Region" row:

...
<td>7</td>
<td bgcolor="" align="left" style=" width:496px"><a class="xnternal" href="http://www.whitecase.com">White and Case</a></td> 
<td bgcolor="" align="left">New York</td> 
<td bgcolor="" align="left" class="Region">N/A</td> 
<td bgcolor="" align="left">1,863</td> 
<td bgcolor="" align="left">565</td> 
<td bgcolor="" align="left">1,133</td> 
<td bgcolor="" align="left">$160,000</td>
<td bgcolor="" align="center"><a class="xnternal" href="/nlj250/firmDetail/7"> View Profile </a></td></tr><tr class="small" bgcolor="#FFFFFF">
...

I tested with this handler:

class TestUrlOpen(webapp.RequestHandler):
    def get(self):
        soup = BeautifulSoup(urllib.urlopen("http://www.ilrg.com/nlj250/"))
        link_list = []
        for a in soup.findAll('a',href=True):
            link_list.append(a["href"])
        self.response.out.write("""<p>link_list: %s</p>""" % link_list)

This works but it also get the "View Profile" link which I don't want:

link_list: [u'http://www.ilrg.com/', u'http://www.ilrg.com/', u'http://www.ilrg.com/nations/', u'http://www.ilrg.com/gov.html', ......]

I can easily remove the "u'http://www.ilrg.com/'" after scraping the site but it would be nice to have a list without it. What is the best way to do this? Thanks.


回答1:


I think this may be what you are looking for. The attrs argument can be helpful for isolating the sections you want.

from BeautifulSoup import BeautifulSoup
import urllib

soup = BeautifulSoup(urllib.urlopen("http://www.ilrg.com/nlj250/"))

rows = soup.findAll(name='tr',attrs={'class':'small'})
for row in rows:
    number = row.find('td').text
    tds = row.findAll(name='td',attrs={'align':'left'})
    link = tds[0].find('a')['href']
    firm = tds[0].text
    office = tds[1].text
    attorneys = tds[3].text
    partners = tds[4].text
    associates = tds[5].text
    salary = tds[6].text
    print number, firm, office, attorneys, partners, associates, salary



回答2:


I would get each tr, in the table with the class=listings. Your search is obviously too broad for the information you want. Because HTML has a structure you can easily get just the table data. This is easier in the long run then getting all hrefs and filtering the ones that you don't want out. BeautifulSoup has plent of documentation on how to do this. http://www.crummy.com/software/BeautifulSoup/documentation.html

not exact code:

for tr in soup.findAll('tr'):
  data_list = tr.children()
  data_list[0].content  # 7
  data_list[1].content  # New York
  data_list[2].content # Region <-- ignore this
  # etc


来源:https://stackoverflow.com/questions/7604272/how-to-select-some-urls-with-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!