ParseError: not well-formed (invalid token) using cElementTree

前端 未结 13 1034
日久生厌
日久生厌 2020-12-16 11:10

I receive xml strings from an external source that can contains unsanitized user contributed content.

The following xml string gave a ParseError in cElementTre

相关标签:
13条回答
  • 2020-12-16 11:27

    It seems to complain about \x08 you will need to escape that.

    Edit:

    Or you can have the parser ignore the errors using recover

    from lxml import etree
    parser = etree.XMLParser(recover=True)
    etree.fromstring(xmlstring, parser=parser)
    
    0 讨论(0)
  • 2020-12-16 11:28

    This is most probably an encoding error. For example I had an xml file encoded in UTF-8-BOM (checked from the Notepad++ Encoding menu) and got similar error message.

    The workaround (Python 3.6)

    import io
    from xml.etree import ElementTree as ET
    
    with io.open(file, 'r', encoding='utf-8-sig') as f:
        contents = f.read()
        tree = ET.fromstring(contents)
    

    Check the encoding of your xml file. If it is using different encoding, change the 'utf-8-sig' accordingly.

    0 讨论(0)
  • 2020-12-16 11:29

    None of the above fixes worked for me. The only thing that worked was to use BeautifulSoup instead of ElementTree as follows:

    from bs4 import BeautifulSoup
    
    with open("data/myfile.xml") as fp:
        soup = BeautifulSoup(fp, 'xml')
    

    Then you can search the tree as:

    soup.find_all('mytag')
    
    0 讨论(0)
  • 2020-12-16 11:29

    I tried the other solutions in the answers here but had no luck. Since I only needed to extract the value from a single xml node I gave in and wrote my function to do so:

    def ParseXmlTagContents(source, tag, tagContentsRegex):
        openTagString = "<"+tag+">"
        closeTagString = "</"+tag+">"
        found = re.search(openTagString + tagContentsRegex + closeTagString, source)
        if found:   
            start = found.regs[0][0]
            end = found.regs[0][1]
            return source[start+len(openTagString):end-len(closeTagString)]
        return ""
    

    Example usage would be:

    <?xml version="1.0" encoding="utf-16"?>
    <parentNode>
        <childNode>123</childNode>
    </parentNode>
    
    ParseXmlTagContents(xmlString, "childNode", "[0-9]+")
    
    0 讨论(0)
  • 2020-12-16 11:30

    The only thing that worked for me is I had to add mode and encoding while opening the file like below:

    with open(filenames[0], mode='r',encoding='utf-8') as f:
         readFile()
    

    Otherwise it was failing every time with invalid token error if I simply do this:

     f = open(filenames[0], 'r')
     readFile()
    
    0 讨论(0)
  • 2020-12-16 11:31

    A solution for gottcha for me, using Python's ElementTree... this has the invalid token error:

    # -*- coding: utf-8 -*-
    import xml.etree.ElementTree as ET
    
    xml = u"""<?xml version='1.0' encoding='utf8'?>
    <osm generator="pycrocosm server" version="0.6"><changeset created_at="2017-09-06T19:26:50.302136+00:00" id="273" max_lat="0.0" max_lon="0.0" min_lat="0.0" min_lon="0.0" open="true" uid="345" user="john"><tag k="test" v="Съешь же ещё этих мягких французских булок да выпей чаю" /><tag k="foo" v="bar" /><discussion><comment data="2015-01-01T18:56:48Z" uid="1841" user="metaodi"><text>Did you verify those street names?</text></comment></discussion></changeset></osm>"""
    
    xmltest = ET.fromstring(xml.encode("utf-8"))
    

    However, it works with the addition of a hyphen in the encoding type:

    <?xml version='1.0' encoding='utf-8'?>
    

    Most odd. Someone found this footnote in the python docs:

    The encoding string included in XML output should conform to the appropriate standards. For example, “UTF-8” is valid, but “UTF8” is not.

    0 讨论(0)
提交回复
热议问题