问题
I have an xml file I need to open and make some changes to, one of those changes is to remove the namespace and prefix and then save to another file. Here is the xml:
<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<package xmlns=\"http://apple.com/itunes/importer\">
<provider>some data</provider>
<language>en-GB</language>
</package>
I can make the other changes I need, but can\'t find out how to remove the namespace and prefix. This is the reusklt xml I need:
<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<package>
<provider>some data</provider>
<language>en-GB</language>
</package>
And here is my script which will open and parse the xml and save it:
metadata = \'/Users/user1/Desktop/Python/metadata.xml\'
from lxml import etree
parser = etree.XMLParser(remove_blank_text=True)
open(metadata)
tree = etree.parse(metadata, parser)
root = tree.getroot()
tree.write(\'/Users/user1/Desktop/Python/done.xml\', pretty_print = True, xml_declaration = True, encoding = \'UTF-8\')
So how would I add code in my script which will remove the namespace and prefix?
回答1:
Replace tag as Uku Loskit suggests. In addition to that, use lxml.objectify.deannotate.
from lxml import etree, objectify
metadata = '/Users/user1/Desktop/Python/metadata.xml'
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(metadata, parser)
root = tree.getroot()
####
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i+1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
tree.write('/Users/user1/Desktop/Python/done.xml',
pretty_print=True, xml_declaration=True, encoding='UTF-8')
UPDATE
Some tags like Comment
return a function when accessing tag
attribute. added a guard for that. (1)
回答2:
>>> root.tag
'{http://latest/nmc-omc/cmNrm.doc#measCollec}measCollecFile'
>>> etree.QName(root.tag).localname
'measCollecFile'
source
Addendum: lxml.etree.QName
also accepts elements on construction. So etree.QName(root.tag).localname
is equivalent to:
etree.QName(root).localname
回答3:
import xml.etree.ElementTree as ET
def remove_namespace(doc, namespace):
"""Remove namespace in the passed document in place."""
ns = u'{%s}' % namespace
nsl = len(ns)
for elem in doc.getiterator():
if elem.tag.startswith(ns):
elem.tag = elem.tag[nsl:]
metadata = '/Users/user1/Desktop/Python/metadata.xml'
tree = ET.parse(metadata)
root = tree.getroot()
remove_namespace(root, u'http://apple.com/itunes/importer')
tree.write('/Users/user1/Desktop/Python/done.xml',
pretty_print=True, xml_declaration=True, encoding='UTF-8')
Used a snippet of code from here This method could be easily extended to delete any namespace attributes by searching for tags that begin with "xmlns"
回答4:
all you need to do is:
objectify.deannotate(root, cleanup_namespaces=True)
after you have get the root, by using root = tree.getroot()
回答5:
You could also use XSLT to strip the namespaces...
XSLT 1.0 (test.xsl)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="*" priority="1">
<xsl:element name="{local-name()}" namespace="">
<xsl:apply-templates select="@*|node()"/>
</xsl:element>
</xsl:template>
<xsl:template match="@*">
<xsl:attribute name="{local-name()}" namespace="">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
Python
from lxml import etree
tree = etree.parse("metadata.xml")
xslt = etree.parse("test.xsl")
new_tree = tree.xslt(xslt)
print(etree.tostring(new_tree, pretty_print=True, xml_declaration=True,
encoding="UTF-8").decode("UTF-8"))
Output
<?xml version='1.0' encoding='UTF-8'?>
<package>
<provider>some data</provider>
<language>en-GB</language>
</package>
回答6:
Here are two other ways of removing namespaces. The first uses the lxml.etree.QName helper while the second uses regexes. Both functions allow an optional list of namespaces to match against. If no namespace list is supplied then all namespaces are removed. Attribute keys are also cleaned.
from lxml import etree
import re
def remove_namespaces_qname(doc, namespaces=None):
for el in doc.getiterator():
# clean tag
q = etree.QName(el.tag)
if q is not None:
if namespaces is not None:
if q.namespace in namespaces:
el.tag = q.localname
else:
el.tag = q.localname
# clean attributes
for a, v in el.items():
q = etree.QName(a)
if q is not None:
if namespaces is not None:
if q.namespace in namespaces:
del el.attrib[a]
el.attrib[q.localname] = v
else:
del el.attrib[a]
el.attrib[q.localname] = v
return doc
def remove_namespace_re(doc, namespaces=None):
if namespaces is not None:
ns = list(map(lambda n: u'{%s}' % n, namespaces))
for el in doc.getiterator():
# clean tag
m = re.match(r'({.+})(.+)', el.tag)
if m is not None:
if namespaces is not None:
if m.group(1) in ns:
el.tag = m.group(2)
else:
el.tag = m.group(2)
# clean attributes
for a, v in el.items():
m = re.match(r'({.+})(.+)', a)
if m is not None:
if namespaces is not None:
if m.group(1) in ns:
del el.attrib[a]
el.attrib[m.group(2)] = v
else:
del el.attrib[a]
el.attrib[m.group(2)] = v
return doc
来源:https://stackoverflow.com/questions/18159221/remove-namespace-and-prefix-from-xml-in-python-using-lxml