What is right way to use cyrillic in python lxml library

风流意气都作罢 提交于 2019-12-10 15:58:13

问题


I try to generate .xml files fith cyrillic symbols within. But result is unexpected. What is the simplest way to avoid this result? Example:

from lxml import etree

root = etree.Element('пример')

print(etree.tostring(root))

What I get is:

b'<&#1087;&#1088;&#1080;&#1084;&#1077;&#1088;/>'

Istead of:

b'<пример/>'

回答1:


etree.tostring() without additional arguments outputs ASCII-only data as a bytes object. You could use etree.tounicode():

>>> from lxml import etree
>>> root = etree.Element('пример')
>>> print(etree.tostring(root))
b'<&#1087;&#1088;&#1080;&#1084;&#1077;&#1088;/>'
>>> print(etree.tounicode(root))
<пример/>

or specify a codec with the encoding argument; you'd still get bytes however, so the output would need to be decoded again:

>>> print(etree.tostring(root, encoding='utf8'))
b'<\xd0\xbf\xd1\x80\xd0\xb8\xd0\xbc\xd0\xb5\xd1\x80/>'
>>> print(etree.tostring(root, encoding='utf8').decode('utf8'))
<пример/>

Setting the encoding to unicode gives you the same output tounicode() produces, and is the preferred spelling:

>>> print(etree.tostring(root, encoding='unicode'))
<пример/>


来源:https://stackoverflow.com/questions/29750592/what-is-right-way-to-use-cyrillic-in-python-lxml-library

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!