Clean Up HTML in Python

后端未结

关注

 5  2098

有刺的猬 2020-12-08 16:22

I\'m aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or mal

5条回答

萌比男神i (楼主)

2020-12-08 17:17
I am using lxml to convert HTML to proper (well-formed) XML:
```
from lxml import etree
tree   = etree.HTML(input_text.replace('\r', ''))
output_text = '\n'.join([ etree.tostring(stree, pretty_print=True, method="xml") 
                          for stree in tree ])
```
... and doing lot of removing of 'dangerous elements' in the middle....
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...