XPath predicate with sub-paths with lxml?

问题

I'm trying to understand and XPath that was sent to me for use with ACORD XML forms (common format in insurance). The XPath they sent me is (truncated for brevity):

./PersApplicationInfo/InsuredOrPrincipal[InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"]/GeneralPartyInfo

Where I'm running into trouble is that Python's lxml library is telling me that [InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"] is an invalid predicate. I'm not able to find anywhere in the XPath spec on predicates which identifies this syntax so that I can modify this predicate to work.

Is there any documentation on what exactly this predicate is selecting? Also, is this even a valid predicate, or has something been mangled somewhere?

Possibly related:

I believe the company I am working with is an MS shop, so this XPath may be valid in C# or some other language in that stack? I'm not entirely sure.

Updates:

Per comment demand, here is some additional info.

XML sample:

<ACORD>
  <InsuranceSvcRq>
    <HomePolicyQuoteInqRq>
      <PersPolicy>
        <PersApplicationInfo>
            <InsuredOrPrincipal>
                <InsuredOrPrincipalInfo>
                    <InsuredOrPrincipalRoleCd>AN</InsuredOrPrincipalRoleCd>
                </InsuredOrPrincipalInfo>
                <GeneralPartyInfo>
                    <Addr>
                        <Addr1></Addr1>
                    </Addr>
                </GeneralPartyInfo>
            </InsuredOrPrincipal>
        </PersApplicationInfo>
      </PersPolicy>
    </HomePolicyQuoteInqRq>
  </InsuranceSvcRq>
</ACORD>

Code sample (with full XPath instead of snippet):

>>> from lxml import etree
>>> tree = etree.fromstring(raw)
>>> tree.find('./InsuranceSvcRq/HomePolicyQuoteInqRq/PersPolicy/PersApplicationInfo/InsuredOrPrincipal[InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"]/GeneralPartyInfo/Addr/Addr1')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "lxml.etree.pyx", line 1409, in lxml.etree._Element.find (src/lxml/lxml.etree.c:39972)
  File "/Library/Python/2.5/site-packages/lxml-2.3-py2.5-macosx-10.3-i386.egg/lxml/_elementpath.py", line 271, in find
    it = iterfind(elem, path, namespaces)
  File "/Library/Python/2.5/site-packages/lxml-2.3-py2.5-macosx-10.3-i386.egg/lxml/_elementpath.py", line 261, in iterfind
    selector = _build_path_iterator(path, namespaces)
  File "/Library/Python/2.5/site-packages/lxml-2.3-py2.5-macosx-10.3-i386.egg/lxml/_elementpath.py", line 245, in _build_path_iterator
    selector.append(ops[token[0]](_next, token))
  File "/Library/Python/2.5/site-packages/lxml-2.3-py2.5-macosx-10.3-i386.egg/lxml/_elementpath.py", line 207, in prepare_predicate
    raise SyntaxError("invalid predicate")
SyntaxError: invalid predicate

回答1:

Change tree.find to tree.xpath. find and findall are present in lxml to provide compatibility with other implementations of ElementTree. These methods do not implement the entire XPath language. To employ XPath expressions containing more advanced features, use the xpath method, the XPath class, or XPathEvaluator.

For example:

import io
import lxml.etree as ET

content='''\
<ACORD>
  <InsuranceSvcRq>
    <HomePolicyQuoteInqRq>
      <PersPolicy>
        <PersApplicationInfo>
            <InsuredOrPrincipal>
                <InsuredOrPrincipalInfo>
                    <InsuredOrPrincipalRoleCd>AN</InsuredOrPrincipalRoleCd>
                </InsuredOrPrincipalInfo>
                <GeneralPartyInfo>
                    <Addr>
                        <Addr1></Addr1>
                    </Addr>
                </GeneralPartyInfo>
            </InsuredOrPrincipal>
        </PersApplicationInfo>
      </PersPolicy>
    </HomePolicyQuoteInqRq>
  </InsuranceSvcRq>
</ACORD>
'''
tree=ET.parse(io.BytesIO(content))
path='//PersApplicationInfo/InsuredOrPrincipal[InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"]/GeneralPartyInfo'
result=tree.xpath(path)
print(result)

yields

[<Element GeneralPartyInfo at b75a8194>]

while tree.find yields

SyntaxError: invalid node predicate

回答2:

Your example is perfectly fine in my opinion. I would check if lxmls XPath implementation has some documented limitations or something like that.

回答3:

./PersApplicationInfo/InsuredOrPrincipal
                 [InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"]
                     /GeneralPartyInfo/

A few problems with this expression:

The ending / character makes it syntactically invalid. It marks the start of a new location step, but nothing follows.
As Dr. Michael Kay noticed, you may have problems with nested quotes in Python.

Suggested solution:

./PersApplicationInfo/InsuredOrPrincipal
                 [InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd='AN']
                     /GeneralPartyInfo

In this expression double quotes are replaced with single quotes. The second change is the removal of the ending / character.

Update: Now the OP has provided a more complete code sample, I am able to verify that there is nothing wrong with the actual XPath expression used. Below is its verification with XSLT:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

 <xsl:template match="/*">
  <xsl:copy-of select=
  './InsuranceSvcRq/HomePolicyQuoteInqRq/PersPolicy
                 /PersApplicationInfo/InsuredOrPrincipal
                     [InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"]
                                                   /GeneralPartyInfo/Addr/Addr1'/>
 </xsl:template>
</xsl:stylesheet>

when this tranformation is applied on the provided XML document:

<ACORD>
    <InsuranceSvcRq>
        <HomePolicyQuoteInqRq>
            <PersPolicy>
                <PersApplicationInfo>
                    <InsuredOrPrincipal>
                        <InsuredOrPrincipalInfo>
                            <InsuredOrPrincipalRoleCd>AN</InsuredOrPrincipalRoleCd>
                        </InsuredOrPrincipalInfo>
                        <GeneralPartyInfo>
                            <Addr>
                                <Addr1></Addr1>
                            </Addr>
                        </GeneralPartyInfo>
                    </InsuredOrPrincipal>
                </PersApplicationInfo>
            </PersPolicy>
        </HomePolicyQuoteInqRq>
    </InsuranceSvcRq>
</ACORD>

the wanted, correct result is produced:

<Addr1 />

Conclusion: The problem is either in the Python code use, or (less-likely) the used XPath engine has a bug.

回答4:

The XPath you were given is perfectly correct. Perhaps the problem arose with embedding it in Python, where you will need to use Python escape conventions to escape the double-quotes in a character string?

来源：https://stackoverflow.com/questions/6218126/xpath-predicate-with-sub-paths-with-lxml

标签

python

xml

xpath

lxml