ParseError: not well-formed (invalid token) using cElementTree

匆匆过客 提交于 2019-11-28 11:55:21

It seems to complain about \x08 you will need to escape that.

Edit:

Or you can have the parser ignore the errors using recover

from lxml import etree
parser = etree.XMLParser(recover=True)
etree.fromstring(xmlstring, parser=parser)

I was having the same error (with ElementTree). In my case it was because of encodings, and I was able to solve it without having to use an external library. Hope this helps other people finding this question based on the title. (reference)

import xml.etree.ElementTree as ET
parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(xmlstring, parser=parser)

EDIT: Based on comments, this answer might be outdated. But this did work back when it was answered...

Boldewyn

See this answer to another question and the according part of the XML spec.

The backspace U+0008 is an invalid character in XML documents. It must be represented as escaped entity  and cannot occur plainly.

If you need to process this XML snippet, you must replace \x08 in s before feeding it into an XML parser.

A solution for gottcha for me, using Python's ElementTree... this has the invalid token error:

# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET

xml = u"""<?xml version='1.0' encoding='utf8'?>
<osm generator="pycrocosm server" version="0.6"><changeset created_at="2017-09-06T19:26:50.302136+00:00" id="273" max_lat="0.0" max_lon="0.0" min_lat="0.0" min_lon="0.0" open="true" uid="345" user="john"><tag k="test" v="Съешь же ещё этих мягких французских булок да выпей чаю" /><tag k="foo" v="bar" /><discussion><comment data="2015-01-01T18:56:48Z" uid="1841" user="metaodi"><text>Did you verify those street names?</text></comment></discussion></changeset></osm>"""

xmltest = ET.fromstring(xml.encode("utf-8"))

However, it works with the addition of a hyphen in the encoding type:

<?xml version='1.0' encoding='utf-8'?>

Most odd. Someone found this footnote in the python docs:

The encoding string included in XML output should conform to the appropriate standards. For example, “UTF-8” is valid, but “UTF8” is not.

Yura Vasiliuk

I have been in stuck with similar problem. Finally figured out the what was the root cause in my particular case. If you read the data from multiple XML files that lie in same folder you will parse also .DS_Store file. Before parsing add this condition

for file in files:
    if file.endswith('.xml'):
       run_your_code...

This trick helped me as well

None of the above fixes worked for me. The only thing that worked was to use BeautifulSoup instead of ElementTree as follows:

from bs4 import BeautifulSoup

with open("data/myfile.xml") as fp:
    soup = BeautifulSoup(fp, 'xml')

Then you can search the tree as:

soup.find_all('mytag')
Konrad

What helped me with that error was Juan's answer - https://stackoverflow.com/a/20204635/4433222 But wasn't enough - after struggling I found out that an XML file needs to be saved with UTF-8 without BOM encoding.

The solution wasn't working for "normal" UTF-8.

This is most probably an encoding error. For example I had an xml file encoded in UTF-8-BOM (checked from the Notepad++ Encoding menu) and got similar error message.

The workaround (Python 3.6)

import io
from xml.etree import ElementTree as ET

with io.open(file, 'r', encoding='utf-8-sig') as f:
    contents = f.read()
    tree = ET.fromstring(contents)

Check the encoding of your xml file. If it is using different encoding, change the 'utf-8-sig' accordingly.

I tried the other solutions in the answers here but had no luck. Since I only needed to extract the value from a single xml node I gave in and wrote my function to do so:

def ParseXmlTagContents(source, tag, tagContentsRegex):
    openTagString = "<"+tag+">"
    closeTagString = "</"+tag+">"
    found = re.search(openTagString + tagContentsRegex + closeTagString, source)
    if found:   
        start = found.regs[0][0]
        end = found.regs[0][1]
        return source[start+len(openTagString):end-len(closeTagString)]
    return ""

Example usage would be:

<?xml version="1.0" encoding="utf-16"?>
<parentNode>
    <childNode>123</childNode>
</parentNode>

ParseXmlTagContents(xmlString, "childNode", "[0-9]+")
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!