问题
I'm using BeautifulSoup4 and I'm curious whether is there a function which returns a structure (ordered tags) of the HTML code.
Here is an example:
<html>
<body>
<h1>Simple example</h1>
<p>This is a simple example of html page</p>
</body>
</html>
print page.structure():
>>
<html>
<body>
<h1></h1>
<p></p>
</body>
</html>
I tried to find a solution but no success.
Thanks
回答1:
There is not, to my knowledge, but a little recursion should work:
def taggify(soup):
for tag in soup:
if isinstance(tag, bs4.Tag):
yield '<{}>{}</{}>'.format(tag.name,''.join(taggify(tag)),tag.name)
demo:
html = '''<html>
<body>
<h1>Simple example</h1>
<p>This is a simple example of html page</p>
</body>
</html>'''
soup = BeautifulSoup(html)
''.join(taggify(soup))
Out[34]: '<html><body><h1></h1><p></p></body></html>'
回答2:
Simple python regular expressions can do what you want:
import re
html = '''<html>
<body>
<h1>Simple example</h1>
<p>This is a simple example of html page</p>
</body>
</html>'''
structure = ''.join(re.findall(r'(</?.+?>|/n+?)', html))
This methods preserves newline characters.
来源:https://stackoverflow.com/questions/24640959/get-a-structure-of-html-code