Parsing XML with Python - accessing elements

心已入冬 提交于 2020-01-02 19:14:56

问题


I'm using lxml to parse some xml, but for some reason I can't find a specific element.

I'm trying to access the <Constant> elements.

Here's an xml snippet:

  </rdf:Description>
</rdf:RDF>
        </MiriamAnnotation>
        <ListOfSubstrates>
          <Substrate metabolite="Metabolite_5" stoichiometry="1"/>
        </ListOfSubstrates>
        <ListOfModifiers>
          <Modifier metabolite="Metabolite_9" stoichiometry="1"/>
        </ListOfModifiers>
        <ListOfConstants>
          <Constant key="Parameter_4344" name="Kcat" value="433.724"/>
          <Constant key="Parameter_4343" name="km" value="479.617"/>

The code I'm using is like this:

    >>> from lxml import etree as ET
    >>> parsed = ET.parse('ct.cps')
    >>> root = parsed.getroot()    
    >>> for a in root.findall(".//Constant"):
    ...     print a.attrib['key']
    ... 
    >>> for a in root.findall('Constant'):
    ...     print a.get('key')
    ... 
    >>> for a in root.findall('Constant'):
    ...     print a.attrib['key']
    ... 

As you can see, none of these things seem to work.

What am I doing wrong?


EDIT: I'm wondering if it has something to do with the fact that <Constant> elements are empty?


EDIT2: Source xml here: https://www.dropbox.com/s/i6hga7nvmcd6rxx/ct.cps?dl=0


回答1:


Here is how you can get the values you are looking for:

from lxml import etree

parsed = etree.parse('ct.cps')

for a in parsed.findall("//{http://www.copasi.org/static/schema}Constant"):
    print a.attrib["key"]

Output:

Parameter_4344
Parameter_4343
Parameter_4342
Parameter_4341
Parameter_4340
Parameter_4339
Parameter_4338
Parameter_4337
Parameter_4336
Parameter_4335
Parameter_4334
Parameter_4333
Parameter_4332
Parameter_4331
Parameter_4330
Parameter_4329
Parameter_4328
Parameter_4327
Parameter_4326
Parameter_4325
Parameter_4324
Parameter_4323
Parameter_4322
Parameter_4321
Parameter_4320
Parameter_4319

The important thing here is that the COPASI root element in your XML file (the real one at the Dropbox URL) declares a default namespace (http://www.copasi.org/static/schema). This means that the element and all its descendants, including Constant, belong to that namespace.

So instead of Constant elements, you need to look for {http://www.copasi.org/static/schema}Constant elements.

See http://lxml.de/tutorial.html#namespaces.


Here is how you could do it using XPath instead of findall:

from lxml import etree

NSMAP = {"c": "http://www.copasi.org/static/schema"}

parsed = etree.parse('ct.cps')

for a in parsed.xpath("//c:Constant", namespaces=NSMAP):
    print a.attrib["key"]

See http://lxml.de/xpathxslt.html#namespaces-and-prefixes.




回答2:


First, please disregard my comment. It turns out that xml.etree is much better than the standard xml.etree.ElementTree in that it takes care of the namespace. The problem you have is you want to search for '//Constant', which means the nodes can be at any level. However, the root element does not allow you to do it:

>>> root.findall('//Constant')
SyntaxError: cannot use absolute path on element

However, you can do that at higher level:

>>> parsed.findall('//Constant')
[<Element Constant at 0x10a7ce128>, <Element Constant at 0x10a7ce170>]

Update

I am posting here the full text. Since I don't have your full XML file, I make something up to fill in the blank.

from lxml import etree as ET
from StringIO import StringIO

xml_text = """<?xml version='1.0' encoding='utf-8' ?>

<rdf:root  xmlns:rdf='http://foo.bar.com/rdf'>
<rdf:RDF>
  <rdf:Description>
    DescriptionX
  </rdf:Description>
</rdf:RDF>
<rdf:foo>
        <MiriamAnnotation>
          bar
        </MiriamAnnotation>
        <ListOfSubstrates>
          <Substrate metabolite="Metabolite_5" stoichiometry="1"/>
        </ListOfSubstrates>
        <ListOfModifiers>
          <Modifier metabolite="Metabolite_9" stoichiometry="1"/>
        </ListOfModifiers>
        <ListOfConstants>
          <Constant key="Parameter_4344" name="Kcat" value="433.724"/>
          <Constant key="Parameter_4343" name="km" value="479.617"/>
        </ListOfConstants>
</rdf:foo>
</rdf:root>
"""

buffer = StringIO(xml_text)
tree = ET.parse(buffer)
for constant_node in tree.findall('//Constant'):
    print constant_node.attrib['key']



回答3:


Don't use findall. It is has a limited featureset and is designed to be compatible with ElementTree.

Instead, use xpath, which supports namespaces. From the above, it appears that you probably want to say something like

# possibilities, you need to get these right...
ns_dict = {'atom':"http://www.w3.org/2005/Atom",,
    "rdf":"http://www.w3.org/2000/01/rdf-schema#" }

root = parsed.getroot()    
for a in root.xpath('.//rdf:Constant', namespaces=ns_dict):
    print a.attrib['key']

Note that you must include a namespace prefix in your xpath expression whenever an element has a non-blank namespace, and they must map to one of the namespace URLs that match the same URLs in your document.

Update

Since you posted your original document, I see that there is no namespace assigned to the elements you are looking for. This will work, I just tried it with your source document:

for a in tree.xpath("//Constant"):
    print a.attrib['key']

You don't need a namespace because there is no default namespace specified in the document itself.



来源:https://stackoverflow.com/questions/31192887/parsing-xml-with-python-accessing-elements

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!