Strip HTML tags to get strings in python

感情迁移 提交于 2019-12-02 00:37:47

Use beautiful soups - .strings method.

for string in soup.stripped_strings:
print(repr(string))

from the docs:

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:

or

These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead:

Iterate over results and get the value of text attribute:

for element in soup.select(".sidebar li"):
    print element.text

Example:

from bs4 import BeautifulSoup


data = """
<body>
    <ul>
        <li class="first">Def Leppard -  Make Love Like A Man<span>Live</span> </li>
        <li>Inxs - Never Tear Us Apart        </li>
    </ul>
</body>
"""

soup = BeautifulSoup(data)
for element in soup.select('li'):
    print element.text

prints:

Def Leppard -  Make Love Like A ManLive 
Inxs - Never Tear Us Apart        

This example from the documentation gives a very nice one liner.

''.join(BeautifulSoup(source).findAll(text=True))
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!