I need to escape special characters in an invalid XML file which is about 5000 lines long. Here\'s an example of the XML that I have to deal with:
This answer provides XML sanitizer functions, although they don't escape the unescaped characters, but simply drop them instead.
The question wondered how to do it with Beautiful Soup. Here is a function which will sanitize a small XML bytes object with it. It was tested with the package requirements beautifulsoup4==4.8.0 and lxml==4.4.0. Note that lxml is required here by bs4.
import xml.etree.ElementTree
import bs4
def sanitize_xml(content: bytes) -> bytes:
# Ref: https://stackoverflow.com/a/57450722/
try:
xml.etree.ElementTree.fromstring(content)
except xml.etree.ElementTree.ParseError:
return bs4.BeautifulSoup(content, features='lxml-xml').encode()
return content # already valid XML
Obviously there is not much of a point in using both bs4 and lxml when this can be done with lxml alone. This lxml==4.4.0 using sanitizer function is essentially derived from the answer by jfs.
import lxml.etree
def sanitize_xml(content: bytes) -> bytes:
# Ref: https://stackoverflow.com/a/57450722/
try:
lxml.etree.fromstring(content)
except lxml.etree.XMLSyntaxError:
root = lxml.etree.fromstring(content, parser=lxml.etree.XMLParser(recover=True))
return lxml.etree.tostring(root)
return content # already valid XML