Having problems understanding BeautifulSoup filtering

问题

Could someone please explain how the filtering works with Beautiful Soup. Ive got the below HTML I am trying to filter specific data from but I cant seem to access it. Ive tried various approaches, from gathering all class=g's to grabbing just the items of interest in that specific div, but I just get None returns or no prints.

Each page has a <div class="srg"> div with multiple <div class="g"> divs, the data i am looking to use is the data withing <div class="g">. Each of these has multiple divs, but im only interested in the <cite> and <span class="st"> data. I am struggling to understand how the filtering works, any help would be appreciated.

I have attempted stepping through the divs and grabbing the relevant fields:

 soup = BeautifulSoup(response.text)   

 main = soup.find('div', {'class': 'srg'})
 result = main.find('div', {'class': 'g'})
 data = result.find('div', {'class': 's'})
 data2 = data.find('div')
 for item in data2:
     site = item.find('cite')
     comment = item.find('span', {'class': 'st'})

 print site
 print comment

I have also attempted stepping into the initial div and finding all;

 soup = BeautifulSoup(response.text) 

 s = soup.findAll('div', {'class': 's'})

 for result in s:
     site = result.find('cite')
     comment = result.find('span', {'class': 'st'})

 print site
 print comment

Test Data

<div class="srg">
    <div class="g">
    <div class="g">
    <div class="g">
    <div class="g">
        <!--m-->
        <div class="rc" data="30">
            <div class="s">
                <div>
                    <div class="f kv _SWb" style="white-space:nowrap">
                        <cite class="_Rm">http://www.url.com.stuff/here</cite>
                    <span class="st">http://www.url.com. Some info on url etc etc
                    </span>
                </div>
            </div>
        </div>
        <!--n-->
    </div>
    <div class="g">
    <div class="g">
    <div class="g">
</div>

UPDATE

After Alecxe's solution I took another stab at getting it right but was still not getting anything printed. So I decided to take another look at the soup and it looks different. I was previously looking at the response.text from requests. I can only think that BeautifulSoup modifies the response.text or I somehow just got the sample completely wrong the first time (not sure how). However Below is the new sample based on what I am seeing from a soup print. And below that my attempt to get to the element data I am after.

<li class="g">
<h3 class="r">
    <a href="/url?q=url">context</a>
</h3>
<div class="s">
    <div class="kv" style="margin-bottom:2px">
        <cite>www.url.com/index.html</cite> #Data I am looking to grab
        <div class="_nBb">‎
            <div style="display:inline"snipped">
                <span class="_O0"></span>
            </div>
            <div style="display:none" class="am-dropdown-menu" role="menu" tabindex="-1">
                <ul>
                    <li class="_Ykb">
                        <a class="_Zkb" href="/url?/search">Cached</a>
                    </li>
                </ul>
            </div>
        </div>
    </div>
    <span class="st">Details about URI </span> #Data I am looking to grab

Update Attempt

I have tried taking Alecxe's approach to no success so far, am I going down the right road with this?

soup = BeautifulSoup(response.text)

for cite in soup.select("li.g div.s div.kv cite"):
    span = cite.find_next_sibling("span", class_="st")

    print(cite.get_text(strip=True))
    print(span.get_text(strip=True))

回答1:

You don't have to deal with the hierarchy manually - let BeautifulSoup worry about it. Your second approach is close to what you should really be trying to do, but it would fail once you get the div with class="s" with no cite element inside.

Instead, you need to let BeautifulSoup know that you are interested in specific elements containing specific elements. Let's ask for cite elements located inside div elements with class="g" located inside the div element with class="srg" - div.srg div.g cite CSS selector would find us exactly what we are asking about:

for cite in soup.select("div.srg div.g cite"):
    span = cite.find_next_sibling("span", class_="st")

    print(cite.get_text(strip=True))
    print(span.get_text(strip=True))

Then, once the cite is located, we are "going sideways" and grabbing the next span sibling element with class="st". Though, yes, here we are assuming it exists.

For the provided sample data, it prints:

http://www.url.com.stuff/here
http://www.url.com. Some info on url etc etc

The updated code for the updated input data:

for cite in soup.select("li.g div.s div.kv cite"):
    span = cite.find_next("span", class_="st")

    print(cite.get_text(strip=True))
    print(span.get_text(strip=True))

Also, make sure you are using the 4th BeautifulSoup version:

pip install --upgrade beautifulsoup4

And the import statement should be:

from bs4 import BeautifulSoup

回答2:

First get div with class name srg then find all div with class name s inside that srg and get text of that site and comment. Below is the working code for me-

from bs4 import BeautifulSoup

html = """<div class="srg">
    <div class="g">
    <div class="g">
    <div class="g">
    <div class="g">
        <!--m-->
        <div class="rc" data="30">
            <div class="s">
                <div>
                    <div class="f kv _SWb" style="white-space:nowrap">
                        <cite class="_Rm">http://www.url.com.stuff/here</cite>
                    <span class="st">http://www.url.com. Some info on url etc etc
                    </span>
                </div>
            </div>
        </div>
        <!--n-->
    </div>
    <div class="g">
    <div class="g">
    <div class="g">
</div>"""

soup = BeautifulSoup(html , 'html.parser')
labels = soup.find('div',{"class":"srg"})

spans = labels.findAll('div', {"class": 'g'})

sites = []
comments = []

for data in spans:
    site = data.find('cite',{'class':'_Rm'})
    comment = data.find('span',{'class':'st'})
    if site:#Check if site in not None
        if site.text.strip() not in sites:
            sites.append(site.text.strip())
        else:
            pass
    if comment:#Check if comment in not None
        if comment.text.strip() not in comments:
            comments.append(comment.text.strip())
        else: pass

print sites
print comments

Output-

[u'http://www.url.com.stuff/here']
[u'http://www.url.com. Some info on url etc etc']

EDIT--

Why your code does not work

For try One-

You are using result = main.find('div', {'class': 'g'}) it will grab single and first encountered element but first element has not div with class name s . So the next part of this code will not work.

For try Two-

You are printing site and comment that is not in the print scope. So try to print inside for loop.

soup = BeautifulSoup(html,'html.parser') 

s = soup.findAll('div', {'class': 's'})

for result in s:
    site = result.find('cite')
    comment = result.find('span', {'class': 'st'})
    print site.text#Grab text
    print comment.text

来源：https://stackoverflow.com/questions/33944824/having-problems-understanding-beautifulsoup-filtering

标签

python

html

beautifulsoup