Extract text from html file with BeautifulSoup/Python

对着背影说爱祢 提交于 2021-02-11 08:28:06

问题


I am trying to extract the text from a html file. The html file looks like this:

<li class="toclevel-1 tocsection-1">
    <a href="#Baden-Württemberg"><span class="tocnumber">1</span>
        <span class="toctext">Baden-Württemberg</span>
    </a>
</li>
<li class="toclevel-1 tocsection-2">
    <a href="#Bayern">
        <span class="tocnumber">2</span>
        <span class="toctext">Bayern</span>
    </a>
</li>
<li class="toclevel-1 tocsection-3">
    <a href="#Berlin">
        <span class="tocnumber">3</span>
        <span class="toctext">Berlin</span>
    </a>
</li>

I want to extract the last text from the last spantag. In the first line it would be "Baden-Würtemberg" after class="toctext"and then put it to a python list.

in Python I tried the following:

names = soup.find_all("span",{"class":"toctext"})

My output the is this list:

[<span class="toctext">Baden-Württemberg</span>, <span class="toctext">Bayern</span>, <span class="toctext">Berlin</span>]

So how can I extract only the text between the tags?

Thanks to all


回答1:


The find_all method returns a list. Iterate over the list to get the text.

for name in names:
    print(name.text)

Returns:

Baden-Württemberg
Bayern
Berlin

The builtin python dir() and type() methods are always handy to inspect an object.

print(dir(names))

[...,
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort',
 'source']



回答2:


With a list of comprehension you could do the following :

names = soup.find_all("span",{"class":"toctext"})
print([x.text for x in names])


来源:https://stackoverflow.com/questions/56691423/extract-text-from-html-file-with-beautifulsoup-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!