I\'m aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or mal
I am using lxml to convert HTML to proper (well-formed) XML:
from lxml import etree
tree = etree.HTML(input_text.replace('\r', ''))
output_text = '\n'.join([ etree.tostring(stree, pretty_print=True, method="xml")
for stree in tree ])
... and doing lot of removing of 'dangerous elements' in the middle....