Escape unescaped characters in XML with Python

前端 未结 4 1416
灰色年华
灰色年华 2020-12-17 00:06

I need to escape special characters in an invalid XML file which is about 5000 lines long. Here\'s an example of the XML that I have to deal with:



        
4条回答
  •  渐次进展
    2020-12-17 00:28

    This answer provides XML sanitizer functions, although they don't escape the unescaped characters, but simply drop them instead.

    Using bs4 with lxml

    The question wondered how to do it with Beautiful Soup. Here is a function which will sanitize a small XML bytes object with it. It was tested with the package requirements beautifulsoup4==4.8.0 and lxml==4.4.0. Note that lxml is required here by bs4.

    import xml.etree.ElementTree
    
    import bs4
    
    
    def sanitize_xml(content: bytes) -> bytes:
        # Ref: https://stackoverflow.com/a/57450722/
        try:
            xml.etree.ElementTree.fromstring(content)
        except xml.etree.ElementTree.ParseError:
            return bs4.BeautifulSoup(content, features='lxml-xml').encode()
        return content  # already valid XML
    

    Using only lxml

    Obviously there is not much of a point in using both bs4 and lxml when this can be done with lxml alone. This lxml==4.4.0 using sanitizer function is essentially derived from the answer by jfs.

    import lxml.etree
    
    
    def sanitize_xml(content: bytes) -> bytes:
        # Ref: https://stackoverflow.com/a/57450722/
        try:
            lxml.etree.fromstring(content)
        except lxml.etree.XMLSyntaxError:
            root = lxml.etree.fromstring(content, parser=lxml.etree.XMLParser(recover=True))
            return lxml.etree.tostring(root)
        return content  # already valid XML
    

提交回复
热议问题