Parsing blank XML tags with LXML and Python

跟風遠走 提交于 2019-12-20 06:12:47

问题


When parsing XML documents in the format of:

<Car>
    <Color>Blue</Color>
    <Make>Chevy</Make>
    <Model>Camaro</Model>
</Car>

I use the following code:

carData = element.xpath('//Root/Foo/Bar/Car/node()[text()]')
parsedCarData = [{field.tag: field.text for field in carData} for action in carData]
print parsedCarData[0]['Color'] #Blue

This code will not work if a tag is empty such as :

<Car>
    <Color>Blue</Color>
    <Make>Chevy</Make>
    <Model/>
</Car>

Using the same code as above:

carData = element.xpath('//Root/Foo/Bar/Car/node()[text()]')
parsedCarData = [{field.tag: field.text for field in carData} for action in carData]
print parsedCarData[0]['Model'] #Key Error

How would I parse this blank tag.


回答1:


You're putting in a [text()] filter which explicitly asks only for elements which have text nodes them... and then you're unhappy when it doesn't give you elements without text nodes?

Leave that filter out, and you'll get your model element:

>>> s='''
... <root>
...   <Car>
...     <Color>Blue</Color>
...     <Make>Chevy</Make>
...     <Model/>
...   </Car>
... </root>'''
>>> e = lxml.etree.fromstring(s)
>>> carData = e.xpath('Car/node()')
>>> carData
[<Element Color at 0x23a5460>, <Element Make at 0x23a54b0>, <Element Model at 0x23a5500>]
>>> dict(((e.tag, e.text) for e in carData))
{'Color': 'Blue', 'Make': 'Chevy', 'Model': None}

That said -- if your immediate goal is to iterate over the nodes in the tree, you might consider using lxml.etree.iterparse() instead, which will avoid trying to build a full DOM tree in memory and otherwise be much more efficient than building a tree and then iterating over it with XPath. (Think SAX, but without the insane and painful API).

Implementing with iterparse could look like this:

def get_cars(infile):
    in_car = False
    current_car = {}
    for (event, element) in lxml.etree.iterparse(infile, events=('start', 'end')):
        if event == 'start':
            if element.tag == 'Car':
                in_car = True
                current_car = {}
            continue
        if not in_car: continue
        if element.tag == 'Car':
            yield current_car
            continue
        current_car[element.tag] = element.text

for car in get_cars(infile = cStringIO.StringIO('''<root><Car><Color>Blue</Color><Make>Chevy</Make><Model/></Car></root>''')):
  print car

...it's more code, but (if we weren't using StringIO for the example) it could process a file much larger than could fit in memory.




回答2:


I don't know if there's a better solution built inside lxml, but you could just use .get():

print parsedCarData[0].get('Model', '')



回答3:


I would catch the exception:

try:
    print parsedCarData[0]['Model']
except KeyError:
    print 'No model specified'

Exceptions in Python aren't exceptional in the same sense as in other languages, where they are more strictly linked to error conditions. Instead they are frequently part of the normal usage of modules, by design. An iterator raises StopIteration to signal it has reached the end of the iteration, for example.

Edit: If you're sure only this item can be empty @CharlesDuffy has it right in that using get() is probably better. But in general I'd consider using exceptions for handling diverse exceptional output easily.




回答4:


The solution: use a try/except block to catch the key error.



来源:https://stackoverflow.com/questions/9620164/parsing-blank-xml-tags-with-lxml-and-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!