Scraping text from unordered lists using beautiful soup and python

允我心安 提交于 2021-01-29 07:22:42

问题


I am using python and beautiful soup to scrape information from a web page. I am interested in the following section of source code:

<ul class="breadcrumb">
<li><a href="/" title="Return to the home page">Home</a><span 
class="sprite icon-delimiter"></span></li>
<li><a href="/VehicleSearch/Search/Mini" title="View our range of Mini 
vehicles">Mini</a><span class="sprite icon-delimiter"></span></li>
<li class="active"><a href="/VehicleSearch/Search/Mini/Countryman" 
title="View our range of Mini Countryman">Countryman</a></li>
</ul>

I want to extract the text of the unordered list bullets, i.e. 'Home', 'Mini' and 'Countryman' (which are also all links).

My closest try so far was by doing:

for ul in soup.findAll('ul', class_='breadcrumb'):
    print(ul.find('a').contents[0])

But this only found the 'Home' link and not the other two. How can I find all three link texts please?


回答1:


Try to add inner loop for link text:

for ul in soup.findAll('ul', class_='breadcrumb'):
    for link in ul.findAll('a'):
        print(link.text)



回答2:


Why not use a css descendant combinator selector to retrieve the li tags within the class?

from bs4 import BeautifulSoup as bs

html ='''
<ul class="breadcrumb">
<li><a href="/" title="Return to the home page">Home</a><span 
class="sprite icon-delimiter"></span></li>
<li><a href="/VehicleSearch/Search/Mini" title="View our range of Mini 
vehicles">Mini</a><span class="sprite icon-delimiter"></span></li>
<li class="active"><a href="/VehicleSearch/Search/Mini/Countryman" 
title="View our range of Mini Countryman">Countryman</a></li>
</ul>
'''
soup = bs(html, 'lxml')
items = [item.text for item in soup.select('.breadcrumb li')]
print(items)


来源:https://stackoverflow.com/questions/53918186/scraping-text-from-unordered-lists-using-beautiful-soup-and-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!