How to use lxml to grab specific parts of an XML document?

房东的猫 提交于 2019-12-08 06:58:37

问题


I am using Amazon's API to receive information about books. I am trying to use lxml to extract specific parts of the XMl document that are needed for my application. I am not really sure how to use lxml, though. This is as far as I have gotten:

root = etree.XML(response)

To create a etree object for the XML document.

Here is what the XML document looks like: http://pastebin.com/GziDkf1a There are actually multiple "Items", but I only pasted one of them to give you a specific example. For each item, I want to extract the title and ISBN. How do I do that with the etree object that I have?

<ItemSearchResponse><Items><Item><ItemAttributes><Title>I want this info</Title></ItemAttributes></Item></Items></ItemSearchResponse

<ItemSearchResponse><Items><Item><ItemAttributes><ISBN>And I want this info</ISBN></ItemAttributes></Item></Items></ItemSearchResponse

Basically, I do not know how to traverse the tree using my etree object, and I want to learn how.

Edit 1: I am trying the following code:

tree = etree.fromstring(response)
for item in tree.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
    print(item)
    print(item.items()) # Apparently, there is nothing in item.items()
    for key, value in item.items():
        print(key)
        print(value)

But I get the following output: http://dpaste.com/287496/

I added the print(item.items()), and it just seems to be an empty list. Each item is an Element, though, but for some reason, they have no items.

Edit 2: I can use the following code to get the information I want, but it seems like lxml must have an easier way... (this way doesn't seem very efficient):

for item in tree.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
    title_text = ""
    author_text = ""
    isbn_text = ""
    for isbn in item.iterfind(".//"+AMAZON_NS+"ISBN"):
        isbn_text = isbn.text
    for title in item.iterfind(".//"+AMAZON_NS+"Title"):
        title_text = title.text
    for author in item.iterfind(".//"+AMAZON_NS+"Author"):
        author_text = author.text
    print(title_text + " by " + author_text + " has ISBN: " + isbn_text)

回答1:


Since you're getting the entire response as one large XML string, you can use lxml's 'fromstring' method to get it into a complete ElementTree object. Then, you can use the findall function (or actually, since you want to iterate over the results, the iterfind function), but there's a catch: Amazon's XML responses are namespaced, so you have to account for that in order for the lxml libraries to properly search it. Something like this ought to do the trick:

root=etree.fromstring(responseFromAmazon)

# this creates a constant with the namespace in the form that lxml can use it
AMAZON_NS="{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"

# this searches the tree and iterates over results, taking the namespace into account
for eachitem in root.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
   for key,value in eachitem.items():
        if key == 'ISBN':
              # Do your stuff
        if key == 'Title':
              # Do your stuff

EDIT 1

See if this works better:

root=etree.fromstring(responseFromAmazon)
AMAZON_NS="{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
item={}    
for attr in root.iterfind(".//"+AMAZON_NS+"ItemAttributes"):
     item[attr[0].tag.replace(AMAZON_NS,"")]=attr[0].text

Then, you can access item["Title"], item["ISBN"], etc. as needed.




回答2:


This is tested to work with both lxml.etree and xml.etree.cElementTree running Python 2.7.1.

import lxml.etree as ET
# Also works with cElementTree (included in recent standard CPythons).
# Use this import:
# import xml.etree.cElementTree as ET
t = ET.fromstring(xmlstring) # your data -- with 2 missing tags added at the end :-)
AMAZON_NS = "{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
# Find all ItemAttributes elements.
for ia in t.iter(AMAZON_NS+'ItemAttributes'):
    # An ItemAttributes element has *children* named ISBN, Title, Author, etc.
    # NOTE WELL: *children* not *attributes*
    for tag in ('ISBN', 'Title'):
        # Find the first child with that name ...
        elem = ia.find(AMAZON_NS+tag)
        print "%s: %r" % (tag, elem.text)

Output:

ISBN: '0534950973'
Title: 'Introduction to the Theory of Computation'

If you want to produce a dictionary of all the children of the ItemAttributes node, it takes only a minor variation:

import lxml.etree as ET
# Also works with cElementTree (included in recent standard CPythons).
# Use this import:
# import xml.etree.cElementTree as ET
from pprint import pprint as pp
t = ET.fromstring(xmlstring)
AMAZON_NS = "{http://webservices.amazon.com/AWSECommerceService/2009-10-01}"
TAGPOS = len(AMAZON_NS)
# Find all ItemAttributes elements.
for ia in t.iter(AMAZON_NS+'ItemAttributes'):
    item = {}
    # Iterate over all the children of the ItemAttributes node
    for elem in ia:
        # remove namespace stuff from key, remove extraneous whitepace from value
        item[elem.tag[TAGPOS:]] = elem.text.strip()
    pp(item)

and the output is:

{'Author': 'Michael Sipser',
 'Binding': 'Hardcover',
 'DeweyDecimalNumber': '511.35',
 'EAN': '9780534950972',
 'Edition': '2',
 'ISBN': '0534950973',
 'IsEligibleForTradeIn': '1',
 'Label': 'Course Technology',
 'Languages': '',
 'ListPrice': '',
 'Manufacturer': 'Course Technology',
 'NumberOfItems': '1',
 'NumberOfPages': '400',
 'PackageDimensions': '',
 'ProductGroup': 'Book',
 'ProductTypeName': 'ABIS_BOOK',
 'PublicationDate': '2005-02-15',
 'Publisher': 'Course Technology',
 'Studio': 'Course Technology',
 'Title': 'Introduction to the Theory of Computation',
 'TradeInValue': ''}



回答3:


I would recommend using pyaws first. Then you wouldn't have to worry about XML parsing. If not you can use something to the effect of:

from lxml import etree

tree = etree.parse(xmlResponse)
tree.xpath('//ISBN')[0].text



回答4:


from lxml import etree
root = etree.XML("YourXMLData")  
items = root.findall('.//ItemAttributes')
for eachitem in items:
    for key,value in eachitem.items():
        if key == 'ISBN':
              # Do your stuff
        if key == 'Title':
              # Do your stuff

This is one way of doing it. You may play with this wherein, instead of loading the XML as the string, you may use the parse method. But they key thing is using find method and its friends to go to your specific node and then iterate over the node dictionary.



来源:https://stackoverflow.com/questions/4456421/how-to-use-lxml-to-grab-specific-parts-of-an-xml-document

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!