Access next sibling <li> element with BeautifulSoup

本秂侑毒 提交于 2019-12-20 23:22:09

问题


I am completely new to web parsing with Python/BeautifulSoup. I have an HTML that has (part of) the code as follows:

<div id="pages">
    <ul>
        <li class="active"><a href="example.com">Example</a></li>
        <li><a href="example.com">Example</a></li>
        <li><a href="example1.com">Example 1</a></li>
        <li><a href="example2.com">Example 2</a></li>
    </ul>
</div>

I have to visit each link (basically each <li> element) until there are no more <li> tags present. Each time a link is clicked, its corresponding <li> element gets class as 'active'. My code is:

from bs4 import BeautifulSoup
import urllib2
import re

landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)

pageList = soup.find("div", {"id": "pages"})

page = pageList.find("li", {"class": "active"})

This code gives me the first <li> item in the list. My logic is I am keeping on checking if the next_sibling is not None. If it is not None, I am creating an HTTP request to the href attribute of the <a> tag in that sibling <li>. That would get me to the next page, and so on, till there are no more pages.

But I can't figure out how to get the next_sibling of the page variable given above. Is it page.next_sibling.get("href") or something like that? I looked through the documentation, but somehow couldn't find it. Can someone help please?


回答1:


Use find_next_sibling() and be explicit about what sibling element do you want to find:

next_li_element = page.find_next_sibling("li")

next_li_element would become None if the page corresponds to the last active li:

if next_li_element is None:
    # no more pages to go



回答2:


Have you looked at dir(page) or the documentation? If so, how did you miss .find_next_sibling()?

from bs4 import BeautifulSoup
import urllib2
import re

landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)

pageList = soup.find("div", {"id": "pages"})

page = pageList.find("li", {"class": "active"})
sibling = page.find_next_sibling()


来源:https://stackoverflow.com/questions/35141250/access-next-sibling-li-element-with-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!