问题
soup.find("tagName", { "id" : "articlebody" })
Why does this NOT return the <div id="articlebody"> ... </div> tags and stuff in between? It returns nothing. And I know for a fact it exists because I'm staring right at it from
soup.prettify()
soup.find("div", { "id" : "articlebody" }) also does not work.
Edit: There is no answer to this post - how do I delete it? I found that BeautifulSoup is not parsing correctly, which probably actually means the page I'm trying to parse isn't properly formatted in SGML or whatever.
回答1:
You should post your example document, because the code works fine:
>>> import BeautifulSoup
>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div id="articlebody"> ... </div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>
Finding <div>s inside <div>s works as well:
>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div><div id="articlebody"> ... </div></div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>
回答2:
To find an element by its id:
div = soup.find(id="articlebody")
回答3:
Beautiful Soup 4 supports most CSS selectors with the .select() method, therefore you can use an id selector such as:
soup.select('#articlebody')
If you need to specify the element's type, you can add a type selector before the id selector:
soup.select('div#articlebody')
The .select() method will return a collection of elements, which means that it would return the same results as the following .find_all() method example:
soup.find_all('div', id="articlebody")
# or
soup.find_all(id="articlebody")
If you only want to select a single element, then you could just use the .find() method:
soup.find('div', id="articlebody")
# or
soup.find(id="articlebody")
回答4:
I think there is a problem when the 'div' tags are too much nested. I am trying to parse some contacts from a facebook html file, and the Beautifulsoup is not able to find tags "div" with class "fcontent".
This happens with other classes as well. When I search for divs in general, it turns only those that are not so much nested.
The html source code can be any page from facebook of the friends list of a friend of you (not the one of your friends). If someone can test it and give some advice I would really appreciate it.
This is my code, where I just try to print the number of tags "div" with class "fcontent":
from BeautifulSoup import BeautifulSoup
f = open('/Users/myUserName/Desktop/contacts.html')
soup = BeautifulSoup(f)
list = soup.findAll('div', attrs={'class':'fcontent'})
print len(list)
回答5:
Most probably because of the default beautifulsoup parser has problem. Change a different parser, like 'lxml' and try again.
回答6:
In the beautifulsoup source this line allows divs to be nested within divs; so your concern in lukas' comment wouldn't be valid.
NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']
What I think you need to do is to specify the attrs you want such as
source.find('div', attrs={'id':'articlebody'})
回答7:
have you tried soup.findAll("div", {"id": "articlebody"})?
sounds crazy, but if you're scraping stuff from the wild, you can't rule out multiple divs...
回答8:
I used:
soup.findAll('tag', attrs={'attrname':"attrvalue"})
As my syntax for find/findall; that said, unless there are other optional parameters between the tag and attribute list, this shouldn't be different.
回答9:
Happened to me also while trying to scrape Google.
I ended up using pyquery.
Install:
pip install pyquery
Use:
from pyquery import PyQuery
pq = PyQuery('<html><body><div id="articlebody"> ... </div></body></html')
tag = pq('div#articlebody')
回答10:
Here is a code fragment
soup = BeautifulSoup(:"index.html")
titleList = soup.findAll('title')
divList = soup.findAll('div', attrs={ "class" : "article story"})
As you can see I find all tags and then I find all tags with class="article" inside
来源:https://stackoverflow.com/questions/2136267/beautiful-soup-and-extracting-a-div-and-its-contents-by-id