Get contents by class names using Beautiful Soup

问题

Using Beautiful Soup module, how can I get data of a div tag whose class name is feeditemcontent cxfeeditemcontent? Is it:

soup.class['feeditemcontent cxfeeditemcontent']

or:

soup.find_all('class')

This is the HTML source:

<div class="feeditemcontent cxfeeditemcontent">
    <div class="feeditembodyandfooter">
         <div class="feeditembody">
         <span>The actual data is some where here</span>
         </div>
     </div>
 </div>

and this is the Python code:

 from BeautifulSoup import BeautifulSoup
 html_doc = open('home.jsp.html', 'r')

 soup = BeautifulSoup(html_doc)
 class="feeditemcontent cxfeeditemcontent"

回答1:

Try this, maybe it's too much for this simple thing but it works:

def match_class(target):
    target = target.split()
    def do_match(tag):
        try:
            classes = dict(tag.attrs)["class"]
        except KeyError:
            classes = ""
        classes = classes.split()
        return all(c in classes for c in target)
    return do_match

html = """<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>"""

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)

matches = soup.findAll(match_class("feeditemcontent cxfeeditemcontent"))
for m in matches:
    print m
    print "-"*10

matches = soup.findAll(match_class("feeditembody"))
for m in matches:
    print m
    print "-"*10

回答2:

Beautiful Soup 4 treats the value of the "class" attribute as a list rather than a string, meaning jadkik94's solution can be simplified:

from bs4 import BeautifulSoup                                                   

def match_class(target):                                                        
    def do_match(tag):                                                          
        classes = tag.get('class', [])                                          
        return all(c in classes for c in target)                                
    return do_match                                                             

soup = BeautifulSoup(html)                                                      
print soup.find_all(match_class(["feeditemcontent", "cxfeeditemcontent"]))

回答3:

soup.findAll("div", class_="feeditemcontent cxfeeditemcontent")

So, If I want to get all div tags of class header <div class="header"> from stackoverflow.com, an example with BeautifulSoup would be something like:

from bs4 import BeautifulSoup as bs
import requests 

url = "http://stackoverflow.com/"
html = requests.get(url).text
soup = bs(html)

tags = soup.findAll("div", class_="header")

It is already in bs4 documentation.

回答4:

from BeautifulSoup import BeautifulSoup 
f = open('a.htm')
soup = BeautifulSoup(f) 
list = soup.findAll('div', attrs={'id':'abc def'})
print list

回答5:

soup.find("div", {"class" : "feeditemcontent cxfeeditemcontent"})

回答6:

Check this bug report: https://bugs.launchpad.net/beautifulsoup/+bug/410304

As you can see, Beautiful soup can not really understand class="a b" as two classes a and b.

However, as it appears in the first comment there, a simple regexp should suffice. In your case:

soup = BeautifulSoup(html_doc)
for x in soup.findAll("div",{"class":re.compile(r"\bfeeditemcontent\b")}):
    print "result: ",x

Note: That has been fixed in the recent beta. I haven't gone through the docs of the recent versions, may be you could do that. Or if you want to get it working using the older version, you could use the above.

来源：https://stackoverflow.com/questions/11331071/get-contents-by-class-names-using-beautiful-soup

标签

python

beautifulsoup