ParseError: not well-formed (invalid token) using cElementTree

前端 未结 13 1035
日久生厌
日久生厌 2020-12-16 11:10

I receive xml strings from an external source that can contains unsanitized user contributed content.

The following xml string gave a ParseError in cElementTre

相关标签:
13条回答
  • 2020-12-16 11:35

    I have been in stuck with similar problem. Finally figured out the what was the root cause in my particular case. If you read the data from multiple XML files that lie in same folder you will parse also .DS_Store file. Before parsing add this condition

    for file in files:
        if file.endswith('.xml'):
           run_your_code...
    

    This trick helped me as well

    0 讨论(0)
  • 2020-12-16 11:36

    After lots of searching through the entire WWW, I only found out that you have to escape certain characters if you want your XML parser to work! Here's how I did it and worked for me:

    escape_illegal_xml_characters = lambda x: re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', '', x)
    

    And use it like you'd normally do:

    ET.XML(escape_illegal_xml_characters(my_xml_string)) #instead of ET.XML(my_xml_string)
    
    0 讨论(0)
  • 2020-12-16 11:39

    What helped me with that error was Juan's answer - https://stackoverflow.com/a/20204635/4433222 But wasn't enough - after struggling I found out that an XML file needs to be saved with UTF-8 without BOM encoding.

    The solution wasn't working for "normal" UTF-8.

    0 讨论(0)
  • 2020-12-16 11:40

    I was having the same error (with ElementTree). In my case it was because of encodings, and I was able to solve it without having to use an external library. Hope this helps other people finding this question based on the title. (reference)

    import xml.etree.ElementTree as ET
    parser = ET.XMLParser(encoding="utf-8")
    tree = ET.fromstring(xmlstring, parser=parser)
    

    EDIT: Based on comments, this answer might be outdated. But this did work back when it was answered...

    0 讨论(0)
  • 2020-12-16 11:43

    lxml solved the issue, in my case

    from lxml import etree
    
    for _, elein etree.iterparse(xml_file, tag='tag_i_wanted', unicode='utf-8'):
        print(ele.tag, ele.text)  
    

    in another case,

    parser = etree.XMLParser(recover=True)
    tree = etree.parse(xml_file, parser=parser)
    tags_needed = tree.iter('TAG NAME')
    

    Thanks to theeastcoastwest

    Python 2.7

    0 讨论(0)
  • 2020-12-16 11:45

    See this answer to another question and the according part of the XML spec.

    The backspace U+0008 is an invalid character in XML documents. It must be represented as escaped entity  and cannot occur plainly.

    If you need to process this XML snippet, you must replace \x08 in s before feeding it into an XML parser.

    0 讨论(0)
提交回复
热议问题