I\'m aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or mal
This can be done using the tidy_document function in tidylib module.
import tidylib
html = '...'
inputEncoding = 'utf8'
options = {
str("output-xhtml"): True, #"output-xml" : True
str("quiet"): True,
str("show-errors"): 0,
str("force-output"): True,
str("numeric-entities"): True,
str("show-warnings"): False,
str("input-encoding"): inputEncoding,
str("output-encoding"): "utf8",
str("indent"): False,
str("tidy-mark"): False,
str("wrap"): 0
};
document, errors = tidylib.tidy_document(html, options=options)