Remove namespace and prefix from xml in python using lxml

旧街凉风 提交于 2019-11-26 09:01:20

问题


I have an xml file I need to open and make some changes to, one of those changes is to remove the namespace and prefix and then save to another file. Here is the xml:

<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<package xmlns=\"http://apple.com/itunes/importer\">
  <provider>some data</provider>
  <language>en-GB</language>
</package>

I can make the other changes I need, but can\'t find out how to remove the namespace and prefix. This is the reusklt xml I need:

<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<package>
  <provider>some data</provider>
  <language>en-GB</language>
</package>

And here is my script which will open and parse the xml and save it:

metadata = \'/Users/user1/Desktop/Python/metadata.xml\'
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
open(metadata)
tree = etree.parse(metadata, parser)
root = tree.getroot()
tree.write(\'/Users/user1/Desktop/Python/done.xml\', pretty_print = True, xml_declaration = True, encoding = \'UTF-8\')

So how would I add code in my script which will remove the namespace and prefix?


回答1:


Replace tag as Uku Loskit suggests. In addition to that, use lxml.objectify.deannotate.

from lxml import etree, objectify

metadata = '/Users/user1/Desktop/Python/metadata.xml'
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(metadata, parser)
root = tree.getroot()

####    
for elem in root.getiterator():
    if not hasattr(elem.tag, 'find'): continue  # (1)
    i = elem.tag.find('}')
    if i >= 0:
        elem.tag = elem.tag[i+1:]
objectify.deannotate(root, cleanup_namespaces=True)
####

tree.write('/Users/user1/Desktop/Python/done.xml',
           pretty_print=True, xml_declaration=True, encoding='UTF-8')

UPDATE

Some tags like Comment return a function when accessing tag attribute. added a guard for that. (1)




回答2:


>>> root.tag
'{http://latest/nmc-omc/cmNrm.doc#measCollec}measCollecFile'
>>> etree.QName(root.tag).localname
'measCollecFile'

source

Addendum: lxml.etree.QName also accepts elements on construction. So etree.QName(root.tag).localnameis equivalent to:

etree.QName(root).localname



回答3:


import xml.etree.ElementTree as ET
def remove_namespace(doc, namespace):
    """Remove namespace in the passed document in place."""
    ns = u'{%s}' % namespace
    nsl = len(ns)
    for elem in doc.getiterator():
        if elem.tag.startswith(ns):
            elem.tag = elem.tag[nsl:]

metadata = '/Users/user1/Desktop/Python/metadata.xml'
tree = ET.parse(metadata)
root = tree.getroot()

remove_namespace(root, u'http://apple.com/itunes/importer')
tree.write('/Users/user1/Desktop/Python/done.xml',
       pretty_print=True, xml_declaration=True, encoding='UTF-8')

Used a snippet of code from here This method could be easily extended to delete any namespace attributes by searching for tags that begin with "xmlns"




回答4:


all you need to do is:

objectify.deannotate(root, cleanup_namespaces=True)

after you have get the root, by using root = tree.getroot()




回答5:


You could also use XSLT to strip the namespaces...

XSLT 1.0 (test.xsl)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="*" priority="1">
    <xsl:element name="{local-name()}" namespace="">
      <xsl:apply-templates select="@*|node()"/>
    </xsl:element>
  </xsl:template>

  <xsl:template match="@*">
    <xsl:attribute name="{local-name()}" namespace="">
      <xsl:value-of select="."/>
    </xsl:attribute>
  </xsl:template>

</xsl:stylesheet>

Python

from lxml import etree

tree = etree.parse("metadata.xml")
xslt = etree.parse("test.xsl")

new_tree = tree.xslt(xslt)

print(etree.tostring(new_tree, pretty_print=True, xml_declaration=True, 
                     encoding="UTF-8").decode("UTF-8"))

Output

<?xml version='1.0' encoding='UTF-8'?>
<package>
  <provider>some data</provider>
  <language>en-GB</language>
</package>



回答6:


Here are two other ways of removing namespaces. The first uses the lxml.etree.QName helper while the second uses regexes. Both functions allow an optional list of namespaces to match against. If no namespace list is supplied then all namespaces are removed. Attribute keys are also cleaned.

from lxml import etree
import re

def remove_namespaces_qname(doc, namespaces=None):

    for el in doc.getiterator():

        # clean tag
        q = etree.QName(el.tag)
        if q is not None:
            if namespaces is not None:
                if q.namespace in namespaces:
                    el.tag = q.localname
            else:
                el.tag = q.localname

            # clean attributes
            for a, v in el.items():
                q = etree.QName(a)
                if q is not None:
                    if namespaces is not None:
                        if q.namespace in namespaces:
                            del el.attrib[a]
                            el.attrib[q.localname] = v
                    else:
                        del el.attrib[a]
                        el.attrib[q.localname] = v
    return doc


def remove_namespace_re(doc, namespaces=None):

    if namespaces is not None:
        ns = list(map(lambda n: u'{%s}' % n, namespaces))

    for el in doc.getiterator():

        # clean tag
        m = re.match(r'({.+})(.+)', el.tag)
        if m is not None:
            if namespaces is not None:
                if m.group(1) in ns:
                    el.tag = m.group(2)
            else:
                el.tag = m.group(2)

            # clean attributes
            for a, v in el.items():
                m = re.match(r'({.+})(.+)', a)
                if m is not None:
                    if namespaces is not None:
                        if m.group(1) in ns:
                            del el.attrib[a]
                            el.attrib[m.group(2)] = v
                    else:
                        del el.attrib[a]
                        el.attrib[m.group(2)] = v
    return doc


来源:https://stackoverflow.com/questions/18159221/remove-namespace-and-prefix-from-xml-in-python-using-lxml

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!