问题
I am trying to get the web link of the following, using beautifulsoup
<div class="alignright single">
<a href="http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-women-should-lower-their-garments-to-cover-their-feet/" rel="next">Hadith on Clothing: Women should lower their garments to cover their feet</a> » </div>
</div>
My code is as follow
from bs4 import BeautifulSoup
import urllib2
url1 = "http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-the-lower-garment-should-be-hallway-between-the-shins/"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
nextlink = soup.findAll("div", {"class" : "alignright single"})
a = nextlink.find('a')
print a.get('href')
I get the following error, please help
a = nextlink.find('a')
AttributeError: 'ResultSet' object has no attribute 'find'
回答1:
Use .find() if you want to find just one match:
nextlink = soup.find("div", {"class" : "alignright single"})
or loop over all matches:
for nextlink in soup.findAll("div", {"class" : "alignright single"}):
a = nextlink.find('a')
print a.get('href')
The latter part can also be expressed as:
a = nextlink.find('a', href=True)
print a['href']
where the href=True part only matches elements that have a href attribute, which means that you won't have to use a.get() because the attribute will be there (alternatively, no <a href="..."> link is found and a will be None).
For the given URL in your question, there is only one such link, so .find() is probably most convenient. It may even be possible to just use:
nextlink = soup.find('a', rel='next', href=True)
if nextlink is not None:
print a['href']
with no need to find the surrounding div. The rel="next" attribute looks enough for your specific needs.
As an extra tip: make use of the response headers to tell BeautifulSoup what encoding to use for a page; the urllib2 response object can tell you what, if any, character set the server thinks the HTML page is encoded in:
response = urllib2.urlopen(url1)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
Quick demo of all the parts:
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> response = urllib2.urlopen('http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-the-lower-garment-should-be-hallway-between-the-shins/')
>>> soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
>>> soup.find('a', rel='next', href=True)['href']
u'http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-women-should-lower-their-garments-to-cover-their-feet/'
回答2:
You need to unpack the list so Try this instead:
nextlink = soup.findAll("div", {"class" : "alignright single"})[0]
Or since there's only one match the find method also ought to work:
nextlink = soup.find("div", {"class" : "alignright single"})
来源:https://stackoverflow.com/questions/20469596/extract-link-from-url-using-beautifulsoup