XPath predicate with sub-paths with lxml?

天大地大妈咪最大 提交于 2019-12-04 04:37:12

Change tree.find to tree.xpath. find and findall are present in lxml to provide compatibility with other implementations of ElementTree. These methods do not implement the entire XPath language. To employ XPath expressions containing more advanced features, use the xpath method, the XPath class, or XPathEvaluator.

For example:

import io
import lxml.etree as ET

content='''\
<ACORD>
  <InsuranceSvcRq>
    <HomePolicyQuoteInqRq>
      <PersPolicy>
        <PersApplicationInfo>
            <InsuredOrPrincipal>
                <InsuredOrPrincipalInfo>
                    <InsuredOrPrincipalRoleCd>AN</InsuredOrPrincipalRoleCd>
                </InsuredOrPrincipalInfo>
                <GeneralPartyInfo>
                    <Addr>
                        <Addr1></Addr1>
                    </Addr>
                </GeneralPartyInfo>
            </InsuredOrPrincipal>
        </PersApplicationInfo>
      </PersPolicy>
    </HomePolicyQuoteInqRq>
  </InsuranceSvcRq>
</ACORD>
'''
tree=ET.parse(io.BytesIO(content))
path='//PersApplicationInfo/InsuredOrPrincipal[InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"]/GeneralPartyInfo'
result=tree.xpath(path)
print(result)

yields

[<Element GeneralPartyInfo at b75a8194>]

while tree.find yields

SyntaxError: invalid node predicate

Your example is perfectly fine in my opinion. I would check if lxmls XPath implementation has some documented limitations or something like that.

./PersApplicationInfo/InsuredOrPrincipal
                 [InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"]
                     /GeneralPartyInfo/

A few problems with this expression:

  1. The ending / character makes it syntactically invalid. It marks the start of a new location step, but nothing follows.

  2. As Dr. Michael Kay noticed, you may have problems with nested quotes in Python.

Suggested solution:

./PersApplicationInfo/InsuredOrPrincipal
                 [InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd='AN']
                     /GeneralPartyInfo

In this expression double quotes are replaced with single quotes. The second change is the removal of the ending / character.

Update: Now the OP has provided a more complete code sample, I am able to verify that there is nothing wrong with the actual XPath expression used. Below is its verification with XSLT:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

 <xsl:template match="/*">
  <xsl:copy-of select=
  './InsuranceSvcRq/HomePolicyQuoteInqRq/PersPolicy
                 /PersApplicationInfo/InsuredOrPrincipal
                     [InsuredOrPrincipalInfo/InsuredOrPrincipalRoleCd="AN"]
                                                   /GeneralPartyInfo/Addr/Addr1'/>
 </xsl:template>
</xsl:stylesheet>

when this tranformation is applied on the provided XML document:

<ACORD>
    <InsuranceSvcRq>
        <HomePolicyQuoteInqRq>
            <PersPolicy>
                <PersApplicationInfo>
                    <InsuredOrPrincipal>
                        <InsuredOrPrincipalInfo>
                            <InsuredOrPrincipalRoleCd>AN</InsuredOrPrincipalRoleCd>
                        </InsuredOrPrincipalInfo>
                        <GeneralPartyInfo>
                            <Addr>
                                <Addr1></Addr1>
                            </Addr>
                        </GeneralPartyInfo>
                    </InsuredOrPrincipal>
                </PersApplicationInfo>
            </PersPolicy>
        </HomePolicyQuoteInqRq>
    </InsuranceSvcRq>
</ACORD>

the wanted, correct result is produced:

<Addr1 />

Conclusion: The problem is either in the Python code use, or (less-likely) the used XPath engine has a bug.

The XPath you were given is perfectly correct. Perhaps the problem arose with embedding it in Python, where you will need to use Python escape conventions to escape the double-quotes in a character string?

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!