Why doesn't xpath work when processing an XHTML document with lxml (in python)?

前端 未结 3 889
后悔当初
后悔当初 2020-12-03 07:35

I am testing against the following test document:




        
相关标签:
3条回答
  • 2020-12-03 07:42

    The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace.

    Try this:

    >>> tree.getroot().xpath(
    ...     "//xhtml:img", 
    ...     namespaces={'xhtml':'http://www.w3.org/1999/xhtml'}
    ...     )
    [<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]
    
    0 讨论(0)
  • 2020-12-03 07:56

    If you are going to use tags from a single namespace only, as I see it the case above, you are much better off using lxml.objectify.

    In your case it would be like

    from lxml import objectify
    root = objectify.parse(url) #also available: fromstring
    

    You can access the nodes as

    root.html
    body = root.html.body
    for img in body.img: #Assuming all images are within the body tag
    

    While it might not be of great help in html, it can be highly useful in well structured xml.

    For more info, check out http://lxml.de/objectify.html

    0 讨论(0)
  • 2020-12-03 07:58

    XPath considers all unprefixed names to be in "no namespace".

    In particular the spec says:

    "A QName in the node test is expanded into an expanded-name using the namespace declarations from the expression context. This is the same way expansion is done for element type names in start and end-tags except that the default namespace declared with xmlns is not used: if the QName does not have a prefix, then the namespace URI is null (this is the same way attribute names are expanded). "

    See those two detailed explanations of the problem and its solution: here and here. The solution is to associate a prefix (with the API that's being used) and to use it to prefix any unprefixed name in the XPath expression.

    Hope this helped.

    Cheers,

    Dimitre Novatchev

    0 讨论(0)
提交回复
热议问题