Excluding unwanted results of findAll using BeautifulSoup

旧街凉风 提交于 2019-12-04 18:47:18

问题


Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook:

<p class="review_comment">

So, using the simple code as follows,

content = page.read()  
soup = BeautifulSoup(content)  
results = soup.find_all("p", "review_comment")

I am happily parsing the text that is living here:

<p class="review_comment">
    This place is terrible!</p>

The bad news is that every 30 or so times the soup.find_all gets a match, it also matches and grabs something that I really don't want, which is a user's old review that they've since updated:

<p class="review_comment">
    It's 1999, and I will always love this place…  
<a href="#" class="show-archived">Read more &raquo;</a></p>

In my attempts to exclude these old duplicate reviews, I have tried a hodgepodge of ideas.

  • I've been trying to alter the arguments in my soup.find_all() call to specifically exclude any text that comes before the <a href="#" class="show-archived">Read more &raquo;</a>
  • I've drowned in Regular Expressions-type matching limbo with no success.
  • I can't seem to take advantage of the class="show-archived" attribute.

Any ideas would be gratefully appreciated. Thanks in advance.


回答1:


Is this what you are seeking?

for p in soup.find_all("p", "review_comment"):
    if p.find(class_='show-archived'):
        continue
    # p is now a wanted p


来源:https://stackoverflow.com/questions/19351541/excluding-unwanted-results-of-findall-using-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!