问题
My first post here, I'm trying to find all tags in this specific html and i can't get them out, this is the code:
from bs4 import BeautifulSoup
from urllib import urlopen
url = "http://www.jutarnji.hr"
html_doc = urlopen(url).read()
soup = BeautifulSoup(html_doc)
soup.prettify()
soup.find_all("a", {"class":"black"})
find function returns [], but i see that there are tags with class:"black" in the html, do I miss something?
Thanks, Vedran
回答1:
I also had same problem.
Try
soup.findAll("a",{"class":"black"})
instead of
soup.find_all("a",{"class":"black"})
soup.findAll() works well for me.
回答2:
The problem here is that the website's class tags arent separated from the end of the href attribute value with a space. BeautifulSoup doesnt seem to handle this very well. A reproducable test case is the following
>>> BeautifulSoup.BeautifulSoup('<a href="http://www.jutarnji.hr/crkva-se-ogradila-od--cjenika--don-mikica--osim-krizme--sve-druge-financijske-obveze-su-neprihvatljive/1018314/" class="black">').prettify()
'<a href="http://www.jutarnji.hr/crkva-se-ogradila-od--cjenika--don-mikica--osim-krizme--sve-druge-financijske-obveze-su-neprihvatljive/1018314/" class="black">\n</a>'
>>> BeautifulSoup.BeautifulSoup('<a href="http://www.jutarnji.hr/crkva-se-ogradila-od--cjenika--don-mikica--osim-krizme--sve-druge-financijske-obveze-su-neprihvatljive/1018314/"class="black">').prettify()
''
回答3:
It seems to work for me, so I'd say that the problem is with your html document.
I tried to run the following:
from bs4 import BeautifulSoup
html_doc = """<html>
<body>
<a class="black">
<b>
text1
</b>
<c>
text2
</c>
</a>
<a class="micio">
</a>
<a class="black">
</a>
</body>
</html>"""
soup = BeautifulSoup(html_doc)
soup.prettify()
print(soup.find_all("a", {"class":"black"}))
And as output I got:
[<a class="black">
<b>
text1
</b>
<c>
text2
</c>
</a>, <a class="black">
</a>]
Edit: As @Puneet has pointed out, the problem might be the lack of a white space between the attributes in the html you're fetching.
I tried for instance to change the example above to something like:
html_doc = """<html>
<body>
<aclass="black">
# etc.. as before
And I got an empty list as result: [].
回答4:
it seams that using lxml solves the problem:
from bs4 import BeautifulSoup
import lxml
from urllib import urlopen
url = "http://www.jutarnji.hr"
html_doc = urlopen(url).read()
soup = BeautifulSoup(html_doc, "lxml")
soup.prettify()
soup.find_all("a", {"class":"black"})
来源:https://stackoverflow.com/questions/9947957/python-beautifulsoup-searching-a-tag