Using an XML catalog with Python's lxml?

一曲冷凌霜 提交于 2021-01-22 06:02:52

问题


Is there a way, when I parse an XML document using lxml, to validate that document against its DTD using an external catalog file? I need to be able to work the fixed attributes defined in a document’s DTD.


回答1:


You can add the catalog to the XML_CATALOG_FILES environment variable:

os.environ['XML_CATALOG_FILES'] = 'file:///to/my/catalog.xml'

See this thread. Note that entries in XML_CATALOG_FILES are space-separated URLs. You can use Python's pathname2url and urljoin (with file:) to generate the URL from a pathname.




回答2:


Can you give an example? According to the lxml validation docs, lxml can handle DTD validation (specified in the XML doc or externally in code) and system catalogs, which covers most cases I can think of.

f = StringIO("<!ELEMENT b EMPTY>")
dtd = etree.DTD(f)
dtd = etree.DTD(external_id = "-//OASIS//DTD DocBook XML V4.2//EN")



回答3:


It seems that lxml does not expose this libxml2 feature, grepping the source only turns up some #defines for the error handling:

C:\Dev>grep -ir --include=*.px[id] catalog lxml-2.1.1/src | sed -r "s/\s+/ /g"
lxml-2.1.1/src/lxml/dtd.pxi: catalog.
lxml-2.1.1/src/lxml/xmlerror.pxd: XML_FROM_CATALOG = 20 # The Catalog module
lxml-2.1.1/src/lxml/xmlerror.pxd: XML_WAR_CATALOG_PI = 93 # 93
lxml-2.1.1/src/lxml/xmlerror.pxd: XML_CATALOG_MISSING_ATTR = 1650
lxml-2.1.1/src/lxml/xmlerror.pxd: XML_CATALOG_ENTRY_BROKEN = 1651 # 1651
lxml-2.1.1/src/lxml/xmlerror.pxd: XML_CATALOG_PREFER_VALUE = 1652 # 1652
lxml-2.1.1/src/lxml/xmlerror.pxd: XML_CATALOG_NOT_CATALOG = 1653 # 1653
lxml-2.1.1/src/lxml/xmlerror.pxd: XML_CATALOG_RECURSION = 1654 # 1654
lxml-2.1.1/src/lxml/xmlerror.pxi:CATALOG=20
lxml-2.1.1/src/lxml/xmlerror.pxi:WAR_CATALOG_PI=93
lxml-2.1.1/src/lxml/xmlerror.pxi:CATALOG_MISSING_ATTR=1650
lxml-2.1.1/src/lxml/xmlerror.pxi:CATALOG_ENTRY_BROKEN=1651
lxml-2.1.1/src/lxml/xmlerror.pxi:CATALOG_PREFER_VALUE=1652
lxml-2.1.1/src/lxml/xmlerror.pxi:CATALOG_NOT_CATALOG=1653
lxml-2.1.1/src/lxml/xmlerror.pxi:CATALOG_RECURSION=1654

From the catalog implementation in libxml2 page it seems possible that the 'transparent' handling through installation in /etc/xml/catalog may still work in lxml, but if you need more than that you can always abandon lxml and use the default python bindings, which do expose the catalog functions.



来源:https://stackoverflow.com/questions/12591/using-an-xml-catalog-with-pythons-lxml

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!