lxml etree xmlparser remove unwanted namespace

丶灬走出姿态 提交于 2019-11-27 00:20:57
unutbu
import io
import lxml.etree as ET

content='''\
<Envelope xmlns="http://www.example.com/zzz/yyy">
  <Header>
    <Version>1</Version>
  </Header>
  <Body>
    some stuff
  </Body>
</Envelope>
'''    
dom = ET.parse(io.BytesIO(content))

You can find namespace-aware nodes using the xpath method:

body=dom.xpath('//ns:Body',namespaces={'ns':'http://www.example.com/zzz/yyy'})
print(body)
# [<Element {http://www.example.com/zzz/yyy}Body at 90b2d4c>]

If you really want to remove namespaces, you could use an XSL transformation:

# http://wiki.tei-c.org/index.php/Remove-Namespaces.xsl
xslt='''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no"/>

<xsl:template match="/|comment()|processing-instruction()">
    <xsl:copy>
      <xsl:apply-templates/>
    </xsl:copy>
</xsl:template>

<xsl:template match="*">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
</xsl:template>

<xsl:template match="@*">
    <xsl:attribute name="{local-name()}">
      <xsl:value-of select="."/>
    </xsl:attribute>
</xsl:template>
</xsl:stylesheet>
'''

xslt_doc=ET.parse(io.BytesIO(xslt))
transform=ET.XSLT(xslt_doc)
dom=transform(dom)

Here we see the namespace has been removed:

print(ET.tostring(dom))
# <Envelope>
#   <Header>
#     <Version>1</Version>
#   </Header>
#   <Body>
#     some stuff
#   </Body>
# </Envelope>

So you can now find the Body node this way:

print(dom.find("Body"))
# <Element Body at 8506cd4>
dusan

Try using Xpath:

dom.xpath("//*[local-name() = 'Body']")

Taken (and simplified) from this page, under "The xpath() method" section

The last solution from https://bitbucket.org/olauzanne/pyquery/issue/17 can help you to avoid namespaces with little effort

apply xml.replace(' xmlns:', ' xmlnamespace:') to your xml before using pyquery so lxml will ignore namespaces

In your case, try xml.replace(' xmlns="', ' xmlnamespace="'). However, you might need something more complex if the string is expected in the bodies as well.

You're showing the result of the repr() call. When you programmatically move through the tree, you can simply choose to ignore the namespace.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!