问题
I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python.
For example,<p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p>
should return:Many hundreds of cultivars exist.
P.S. Some files contain Unicode characters (Hindi) which need to be extracted.
Any ideas how to do that?
回答1:
Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the removed tags.
from BeautifulSoup import BeautifulSoup
VALID_TAGS = ['div', 'p']
soup = BeautifulSoup(value)
for tag in soup.findAll('p'):
if tag.name not in VALID_TAGS:
tag.replaceWith(tag.renderContents())
print soup.renderContents()
Reference
回答2:
soup.findAll('p')
here is a reference
来源:https://stackoverflow.com/questions/10113702/how-to-find-all-text-inside-p-elements-in-an-html-page-using-beautifulsoup