Get a structure of HTML code

对着背影说爱祢 提交于 2019-12-21 02:42:22

问题


I'm using BeautifulSoup4 and I'm curious whether is there a function which returns a structure (ordered tags) of the HTML code.

Here is an example:

<html>
<body>
<h1>Simple example</h1>
<p>This is a simple example of html page</p>
</body>
</html>

print page.structure():

>>
<html>
<body>
<h1></h1>
<p></p>
</body>
</html>

I tried to find a solution but no success.

Thanks


回答1:


There is not, to my knowledge, but a little recursion should work:

def taggify(soup):
     for tag in soup:
         if isinstance(tag, bs4.Tag):
             yield '<{}>{}</{}>'.format(tag.name,''.join(taggify(tag)),tag.name)

demo:

html = '''<html>
 <body>
 <h1>Simple example</h1>
 <p>This is a simple example of html page</p>
 </body>
 </html>'''

soup = BeautifulSoup(html)

''.join(taggify(soup))
Out[34]: '<html><body><h1></h1><p></p></body></html>'



回答2:


Simple python regular expressions can do what you want:

import re

html = '''<html>
<body>
<h1>Simple example</h1>
<p>This is a simple example of html page</p>
</body>
</html>'''

structure = ''.join(re.findall(r'(</?.+?>|/n+?)', html))

This methods preserves newline characters.



来源:https://stackoverflow.com/questions/24640959/get-a-structure-of-html-code

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!