lmxl incremental XML serialisation repeats namespaces

被刻印的时光 ゝ 提交于 2021-01-27 07:41:44

问题


I am currently serializing some largish XML files in Python with lxml. I want to use the incremental writer for that. My XML format heavily relies on namespaces and attributes. When I run the following code

from io import BytesIO

from lxml import etree

sink = BytesIO()

nsmap = {
    'test': 'http://test.org',
    'foo': 'http://foo.org',
    'bar': 'http://bar.org',
}

with etree.xmlfile(sink) as xf:
    with xf.element("test:testElement", nsmap=nsmap):
        name = etree.QName(nsmap["foo"], "fooElement")
        elem = etree.Element(name)

        xf.write(elem)

print(sink.getvalue().decode('utf-8'))

then I get the following output:

<test:testElement xmlns:bar="http://bar.org" 
 xmlns:foo="http://foo.org" 
 xmlns:test="http://test.org">
    <ns0:fooElement xmlns:ns0="http://foo.org"/>
</test:testElement>

As you can see, the namespace for foo is repeated and not my prefix:

<ns0:fooElement xmlns:ns0="http://foo.org"/>

How do I make it so lxml only adds the namespace in the root and children use the correct prefix from there? I think I need to use etree.Element, as I need to add some attributes to the node.

What did not work:

1) Using register_namespace

for prefix, uri in nsmap.items():
    etree.register_namespace(prefix, uri)

That still repeats, but makes the prefix correct. I do not like it too much, as it changes stuff globally.

2) Specifying the nsmap in the element:

elem = etree.Element(name, nsmap=nsmap)

yields

<foo:fooElement xmlns:bar="http://bar.org" 
 xmlns:foo="http://foo.org" 
 xmlns:test="http://test.org"/>

for the fooElement.

I also looked in the documentation and source code of lxml, but it is Cython so really hard to read and search. The context manager of xf.element does not return the element. e.g.

with xf.element('foo:fooElement') as e:
    print(e)

prints None.


回答1:


It is possible to produce something close to what you are looking for:

from io import BytesIO

from lxml import etree

sink = BytesIO()

nsmap = {
    'test': 'http://test.org',
    'foo': 'http://foo.org',
    'bar': 'http://bar.org',
}

with etree.xmlfile(sink) as xf:
    with xf.element("test:testElement", nsmap=nsmap):
        with xf.element("foo:fooElement"):
            pass

print(sink.getvalue().decode('utf-8'))

This produces the XML:

<test:testElement xmlns:bar="http://bar.org" xmlns:foo="http://foo.org" xmlns:test="http://test.org"><foo:fooElement></foo:fooElement></test:testElement>

The extra namespace declaration is gone, but instead of an immediately closing element, you get a pair of opening and closing tags for foo:fooElement.

I looked at the source code of lxml.etree.xmlfile and do not see the code there maintaining state that it would then examine to know what namespaces are already declared and avoid declaring them again needlessly. It is possible I just missed something, but I really don't think I did. The point of an incremental XML serializer is to operate without using gobs of memory. When memory is not an issue, you can just create a tree of objects representing the XML document and serialize that. You pay a significant memory cost because the whole tree has to be available in memory until the tree is serialized. By using an incremental serializer, you can dodge the memory issue. In order to maximize the memory savings, the serializer must minimize the amount of state it maintains. If when it produces an element in the serialization, it were to take into account the parents of this element, then it would have to "remember" what the parents were and maintain state. In the worst case scenario it would maintain so much state that it would provide no benefit over just creating a tree of XML objects that are then serialized.




回答2:


You need to create a SubElement:

_nsmap={
    'test': 'http://test.org',
    'foo': 'http://foo.org',
    'bar': 'http://bar.org',
}

root = etree.Element(
    "{http://bar.org}test",
    creator='SO',
    nsmap=_nsmap
)

doc = etree.ElementTree(root)
name = etree.QName(_nsmap["foo"], "fooElement")
elem = etree.SubElement(root, name)

doc.write('/tmp/foo.xml', xml_declaration=True, encoding='utf-8', pretty_print=True)
print (open('/tmp/foo.xml').read())

Returns:

<?xml version='1.0' encoding='UTF-8'?>
<bar:test xmlns:bar="http://bar.org" xmlns:foo="http://foo.org" xmlns:test="http://test.org" creator="SO">
  <foo:fooElement/>
</bar:test>


来源:https://stackoverflow.com/questions/53083828/lmxl-incremental-xml-serialisation-repeats-namespaces

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!