lxml etree xmlparser remove unwanted namespace

前端 未结 4 2116
感动是毒
感动是毒 2020-11-29 00:04

I have an xml doc that I am trying to parse using Etree.lxml


  
1&
相关标签:
4条回答
  • 2020-11-29 00:08
    import io
    import lxml.etree as ET
    
    content='''\
    <Envelope xmlns="http://www.example.com/zzz/yyy">
      <Header>
        <Version>1</Version>
      </Header>
      <Body>
        some stuff
      </Body>
    </Envelope>
    '''    
    dom = ET.parse(io.BytesIO(content))
    

    You can find namespace-aware nodes using the xpath method:

    body=dom.xpath('//ns:Body',namespaces={'ns':'http://www.example.com/zzz/yyy'})
    print(body)
    # [<Element {http://www.example.com/zzz/yyy}Body at 90b2d4c>]
    

    If you really want to remove namespaces, you could use an XSL transformation:

    # http://wiki.tei-c.org/index.php/Remove-Namespaces.xsl
    xslt='''<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" indent="no"/>
    
    <xsl:template match="/|comment()|processing-instruction()">
        <xsl:copy>
          <xsl:apply-templates/>
        </xsl:copy>
    </xsl:template>
    
    <xsl:template match="*">
        <xsl:element name="{local-name()}">
          <xsl:apply-templates select="@*|node()"/>
        </xsl:element>
    </xsl:template>
    
    <xsl:template match="@*">
        <xsl:attribute name="{local-name()}">
          <xsl:value-of select="."/>
        </xsl:attribute>
    </xsl:template>
    </xsl:stylesheet>
    '''
    
    xslt_doc=ET.parse(io.BytesIO(xslt))
    transform=ET.XSLT(xslt_doc)
    dom=transform(dom)
    

    Here we see the namespace has been removed:

    print(ET.tostring(dom))
    # <Envelope>
    #   <Header>
    #     <Version>1</Version>
    #   </Header>
    #   <Body>
    #     some stuff
    #   </Body>
    # </Envelope>
    

    So you can now find the Body node this way:

    print(dom.find("Body"))
    # <Element Body at 8506cd4>
    
    0 讨论(0)
  • 2020-11-29 00:12

    Try using Xpath:

    dom.xpath("//*[local-name() = 'Body']")
    

    Taken (and simplified) from this page, under "The xpath() method" section

    0 讨论(0)
  • 2020-11-29 00:14

    You're showing the result of the repr() call. When you programmatically move through the tree, you can simply choose to ignore the namespace.

    0 讨论(0)
  • 2020-11-29 00:15

    The last solution from https://bitbucket.org/olauzanne/pyquery/issue/17 can help you to avoid namespaces with little effort

    apply xml.replace(' xmlns:', ' xmlnamespace:') to your xml before using pyquery so lxml will ignore namespaces

    In your case, try xml.replace(' xmlns="', ' xmlnamespace="'). However, you might need something more complex if the string is expected in the bodies as well.

    0 讨论(0)
提交回复
热议问题