I receive xml strings from an external source that can contains unsanitized user contributed content.
The following xml string gave a ParseError in cElementTre
I have been in stuck with similar problem. Finally figured out the what was the root cause in my particular case. If you read the data from multiple XML files that lie in same folder you will parse also .DS_Store file. Before parsing add this condition
for file in files:
if file.endswith('.xml'):
run_your_code...
This trick helped me as well
After lots of searching through the entire WWW, I only found out that you have to escape certain characters if you want your XML parser to work! Here's how I did it and worked for me:
escape_illegal_xml_characters = lambda x: re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', '', x)
And use it like you'd normally do:
ET.XML(escape_illegal_xml_characters(my_xml_string)) #instead of ET.XML(my_xml_string)
What helped me with that error was Juan's answer - https://stackoverflow.com/a/20204635/4433222 But wasn't enough - after struggling I found out that an XML file needs to be saved with UTF-8 without BOM encoding.
The solution wasn't working for "normal" UTF-8.
I was having the same error (with ElementTree). In my case it was because of encodings, and I was able to solve it without having to use an external library. Hope this helps other people finding this question based on the title. (reference)
import xml.etree.ElementTree as ET
parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(xmlstring, parser=parser)
EDIT: Based on comments, this answer might be outdated. But this did work back when it was answered...
lxml solved the issue, in my case
from lxml import etree
for _, elein etree.iterparse(xml_file, tag='tag_i_wanted', unicode='utf-8'):
print(ele.tag, ele.text)
in another case,
parser = etree.XMLParser(recover=True)
tree = etree.parse(xml_file, parser=parser)
tags_needed = tree.iter('TAG NAME')
Thanks to theeastcoastwest
Python 2.7
See this answer to another question and the according part of the XML spec.
The backspace U+0008 is an invalid character in XML documents. It must be represented as escaped entity 
and cannot occur plainly.
If you need to process this XML snippet, you must replace \x08
in s
before feeding it into an XML parser.