问题
suppose I want to parse with an lxml xpath expression the folowing xml
<pack xmlns="http://ns.qubic.tv/2010/item">
<packitem>
<duration>520</duration>
<max_count>14</max_count>
</packitem>
<packitem>
<duration>12</duration>
</packitem>
</pack>
which is a variation of what can be found at http://python-thoughts.blogspot.fr/2012/01/default-value-for-text-function-using.html
How can I achieve a parsing of the different elements that would give me once zipped (in the zip or izip python function sense)
[(520,14),(12,None)]
?
The missing max_count
tag in the second packitem holds me back from getting what i want.
回答1:
def lxml_empty_str(context, nodes):
for node in nodes:
node.text = node.text or ""
return nodes
ns = etree.FunctionNamespace('http://ns.qubic.tv/lxmlfunctions')
ns['lxml_empty_str'] = lxml_empty_str
namespaces = {'i':"http://ns.qubic.tv/2010/item",
'f': "http://ns.qubic.tv/lxmlfunctions"}
packitems_duration = root.xpath('f:lxml_empty_str('//b:pack/i:packitem/i:duration)/text()',
namespaces={'b':billing_ns, 'f' : 'http://ns.qubic.tv/lxmlfunctions'})
packitems_max_count = root.xpath('f:lxml_empty_str('//b:pack/i:packitem/i:max_count) /text()',
namespaces={'b':billing_ns, 'f' : 'http://ns.qubic.tv/lxmlfunctions'})
packitems = zip(packitems_duration, packitems_max_count)
>>> packitems
[('520','14'), ('','23')]
http://python-thoughts.blogspot.fr/2012/01/default-value-for-text-function-using.html
回答2:
You could use xpath
to find the packitem
s, then call xpath
again (or findtext
as I do below) to find the duration
and max_count
s. Having to call xpath
more than once may not be terrible speedy, but it works.
import lxml.etree as ET
content = '''<pack xmlns="http://ns.qubic.tv/2010/item">
<packitem>
<duration>520</duration>
<max_count>14</max_count>
</packitem>
<packitem>
<duration>12</duration>
</packitem>
</pack>
'''
def make_int(text):
try:
return int(text)
except TypeError:
return None
namespaces = {'ns' : 'http://ns.qubic.tv/2010/item'}
doc = ET.fromstring(content)
result = [tuple([make_int(elt.findtext(path, namespaces = namespaces))
for path in ('ns:duration', 'ns:max_count')])
for elt in doc.xpath('//ns:packitem', namespaces = namespaces) ]
print(result)
# [(520, 14), (12, None)]
An alternative approach would be to use a SAX parser. That might be a little faster, but it takes a bit more code and the speed difference may not be important if the XML is not huge.
来源:https://stackoverflow.com/questions/11022292/lxml-xpath-in-python-how-to-handle-missing-tags