Clean Up HTML in Python

后端 未结 5 2100
有刺的猬
有刺的猬 2020-12-08 16:22

I\'m aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or mal

5条回答
  •  醉酒成梦
    2020-12-08 17:07

    This can be done using the tidy_document function in tidylib module.

    import tidylib
    html = '...'
    inputEncoding = 'utf8'
    options = {
        str("output-xhtml"): True, #"output-xml" : True
        str("quiet"): True,
        str("show-errors"): 0,
        str("force-output"): True,
        str("numeric-entities"): True,
        str("show-warnings"): False,
        str("input-encoding"): inputEncoding,
        str("output-encoding"): "utf8",
        str("indent"): False,
        str("tidy-mark"): False,
        str("wrap"): 0
        };
    document, errors = tidylib.tidy_document(html, options=options)
    

提交回复
热议问题