lxml unicode entity parse problems

我是研究僧i 提交于 2019-12-20 03:33:11

问题


I'm using lxml as follows to parse an exported XML file from another system:

xmldoc = open(filename)
etree.parse(xmldoc)

But im getting:

lxml.etree.XMLSyntaxError: Entity 'eacute' not defined, line 4495, column 46

Obviously it's having problems with unicode entity names - but how would i get round this? Via open() or parse()?

Edit: I had forgotten to include my DTD in the same folder - it's there now and has the following declaration:

<!ENTITY eacute "&#233;">

and is referred to (and always was) in xmldoc as so:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE DScribeDatabase SYSTEM "foo.dtd">

Yet I still get the same problem ... does the DTD need to be declared in Python too?


回答1:


eacute is not a predefined entity in XML. To include an &eacute; entity reference in an XML file, it must have a <!DOCTYPE> declaration pointing to a DTD (such as an XHTML 1.0 DTD) that defines the entity.

If the XML uses &eacute; but doesn't have a <!DOCTYPE>, it is not well-formed and the system that exported it needs to be fixed.

(There isn't a good reason to use an entity reference to represent é in an XML file. The character reference &#233; is understood everywhere without entity definitions, if the file can't simply include a raw UTF-8 é for some reason.)



来源:https://stackoverflow.com/questions/2835077/lxml-unicode-entity-parse-problems

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!