Excluding unwanted results of findAll using BeautifulSoup

问题

Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook:

<p class="review_comment">

So, using the simple code as follows,

content = page.read()  
soup = BeautifulSoup(content)  
results = soup.find_all("p", "review_comment")

I am happily parsing the text that is living here:

<p class="review_comment">
    This place is terrible!</p>

The bad news is that every 30 or so times the soup.find_all gets a match, it also matches and grabs something that I really don't want, which is a user's old review that they've since updated:

<p class="review_comment">
    It's 1999, and I will always love this place…  
<a href="#" class="show-archived">Read more &raquo;</a></p>

In my attempts to exclude these old duplicate reviews, I have tried a hodgepodge of ideas.

I've been trying to alter the arguments in my soup.find_all() call to specifically exclude any text that comes before the <a href="#" class="show-archived">Read more »</a>
I've drowned in Regular Expressions-type matching limbo with no success.
I can't seem to take advantage of the class="show-archived" attribute.

Any ideas would be gratefully appreciated. Thanks in advance.

回答1:

Is this what you are seeking?

for p in soup.find_all("p", "review_comment"):
    if p.find(class_='show-archived'):
        continue
    # p is now a wanted p

来源：https://stackoverflow.com/questions/19351541/excluding-unwanted-results-of-findall-using-beautifulsoup

标签

python

beautifulsoup

screen-scraping