ElementTree will not parse special characters with Python 2.7

问题

I had to rewrite my python script from python 3 to python2 and after that I got problem parsing special characters with ElementTree.

This is a piece of my xml:

<account number="89890000" type="Kostnad" taxCode="597" vatCode="">Avsättning egenavgifter</account>

This is the ouput when I parse this row:

('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avs\xc3\xa4ttning egenavgifter')

So it seems to be a problem with the character "ä".

This is how i do it in the code:

sys.setdefaultencoding( "UTF-8" )
xmltree = ET()

xmltree.parse("xxxx.xml")

printAccountPlan(xmltree)

def printAccountPlan(xmltree):
    print("account:",str(i.attrib['number']),      "AccountType:",str(i.attrib['type']),"Name:",str(i.text))

Anyone have an ide to get the ElementTree parse the charracter "ä", so the result will be like this:

('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avsättning egenavgifter')

回答1:

You're running into two separate differences between Python 2 and Python 3 at the same time, which is why you're getting unexpected results.

The first difference is one you're probably already aware of: Python's print statement in version 2 became a print function in version 3. That change is creating a special circumstance in your case, which I'll get to a little later. But briefly, this is the difference in how 'print' works:

In Python 3:

>>> # Two arguments 'Hi' and 'there' get passed to the function 'print'.
>>> # They are concatenated with a space separator and printed.
>>> print('Hi', 'there') 
>>> Hi there

In Python 2:

>>> # 'print' is a statement which doesn't need parenthesis.
>>> # The parenthesis instead create a tuple containing two elements 
>>> # 'Hi' and 'there'. This tuple is then printed.
>>> print('Hi', 'there')
>>> ('Hi', 'there')

The second problem in your case is that tuples print themselves by calling repr() on each of their elements. In Python 3, repr() displays unicode as you want. But in Python 2, repr() uses escape characters for any byte values which fall outside the printable ASCII range (e.g., larger than 127). This is why you're seeing them.

You may decide to resolve this issue, or not, depending on what you're goal is with your code. The representation of a tuple in Python 2 uses escape characters because it's not designed to be displayed to an end-user. It's more for your internal convenience as a developer, for troubleshooting and similar tasks. If you're simply printing it for yourself, then you may not need to change a thing because Python is showing you that the encoded bytes for that non-ASCII character are correctly there in your string. If you do want to display something to the end-user which has the format of how tuples look, then one way to do it (which retains correct printing of unicode) is to manually create the formatting, like this:

def printAccountPlan(xmltree):
    data = (i.attrib['number'], i.attrib['type'], i.text)
    print "('account:', '%s', 'AccountType:', '%s', 'Name:', '%s')" % data
# Produces this:
# ('account:', '89890000', 'AccountType:', 'Kostnad', 'Name:', 'Avsättning egenavgifter')

来源：https://stackoverflow.com/questions/19367590/elementtree-will-not-parse-special-characters-with-python-2-7

标签

python

python-2.7

elementtree