Clean Up HTML in Python

后端未结

关注

 5  2100

有刺的猬 2020-12-08 16:22

I\'m aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or mal

5条回答

醉酒成梦 (楼主)

2020-12-08 17:07

This can be done using the tidy_document function in tidylib module.

import tidylib
html = '...'
inputEncoding = 'utf8'
options = {
    str("output-xhtml"): True, #"output-xml" : True
    str("quiet"): True,
    str("show-errors"): 0,
    str("force-output"): True,
    str("numeric-entities"): True,
    str("show-warnings"): False,
    str("input-encoding"): inputEncoding,
    str("output-encoding"): "utf8",
    str("indent"): False,
    str("tidy-mark"): False,
    str("wrap"): 0
    };
document, errors = tidylib.tidy_document(html, options=options)

0 讨论(0)

查看其它5个回答