Clean Up HTML in Python

后端 未结 5 2092
有刺的猬
有刺的猬 2020-12-08 16:22

I\'m aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or mal

5条回答
  •  萌比男神i
    2020-12-08 17:17

    I am using lxml to convert HTML to proper (well-formed) XML:

    from lxml import etree
    tree   = etree.HTML(input_text.replace('\r', ''))
    output_text = '\n'.join([ etree.tostring(stree, pretty_print=True, method="xml") 
                              for stree in tree ])
    

    ... and doing lot of removing of 'dangerous elements' in the middle....

提交回复
热议问题