How to find all text inside <p> elements in an HTML page using BeautifulSoup

杀马特。学长 韩版系。学妹 提交于 2020-01-01 19:38:34

问题


I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python.
For example,
<p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p>
should return:
Many hundreds of cultivars exist.

P.S. Some files contain Unicode characters (Hindi) which need to be extracted.
Any ideas how to do that?


回答1:


Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the removed tags.

from BeautifulSoup import BeautifulSoup

VALID_TAGS = ['div', 'p']

soup = BeautifulSoup(value)

for tag in soup.findAll('p'):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print soup.renderContents()

Reference




回答2:


soup.findAll('p')

here is a reference



来源:https://stackoverflow.com/questions/10113702/how-to-find-all-text-inside-p-elements-in-an-html-page-using-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!