How to create <!DOCTYPE> with Python&#39;s cElementTree

落爺英雄遲暮 提交于 2019-11-26 12:48:26

问题


I have tried to use the answer in this question, but can\'t make it work: How to create "virtual root" with Python's ElementTree?

Here\'s my code:

import xml.etree.cElementTree as ElementTree
from StringIO import StringIO
s = \'<?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\" ?><!DOCTYPE tmx SYSTEM \\\"tmx14a.dtd\\\" ><tmx version=\\\"1.4a\\\" />\'
tree = ElementTree.parse(StringIO(s)).getroot()
header = ElementTree.SubElement(tree,\'header\',{\'adminlang\': \'EN\',})
body = ElementTree.SubElement(tree,\'body\')
ElementTree.ElementTree(tree).write(\'myfile.tmx\',\'UTF-8\')

When I open the resulting \'myfile.tmx\' file, it contains this:

<?xml version=\'1.0\' encoding=\'UTF-8\'?>
<tmx version=\"1.4a\"><header adminlang=\"EN\" /><body /></tmx>

What am I missing? or, is there a better tool?


回答1:


You could use lxml and its tostring function:

from lxml import etree

s = """<?xml version="1.0" encoding="UTF-8"?>
<tmx version="1.4a"/>""" 

tree = etree.fromstring(s)
header = etree.SubElement(tree,'header',{'adminlang': 'EN'})
body = etree.SubElement(tree,'body')

print etree.tostring(tree, encoding="UTF-8",
                     xml_declaration=True,
                     pretty_print=True,
                     doctype='<!DOCTYPE tmx SYSTEM "tmx14a.dtd">')

=>

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE tmx SYSTEM "tmx14a.dtd">
<tmx version="1.4a">
  <header adminlang="EN"/>
  <body/>
</tmx>



回答2:


You could set xml_declaration argument on write function to False, so output won't have xml declaration with encoding, then just append what header you need manually. Actually if you set your encoding as 'utf-8' (lowercase), xml declaration won't be added too.

import xml.etree.cElementTree as ElementTree

tree = ElementTree.Element('tmx', {'version': '1.4a'})
ElementTree.SubElement(tree, 'header', {'adminlang': 'EN'})
ElementTree.SubElement(tree, 'body')

with open('myfile.tmx', 'wb') as f:
    f.write('<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE tmx SYSTEM "tmx14a.dtd">'.encode('utf8'))
    ElementTree.ElementTree(tree).write(f, 'utf-8')

Resulting file (newlines added manually for readability):

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE tmx SYSTEM "tmx14a.dtd">
<tmx version="1.4a">
    <header adminlang="EN" />
    <body />
</tmx>



回答3:


I used different solution to add DOCTYPE, very simple, very stupid.

import xml.etree.ElementTree as ET

with open(path_file, "w", encoding='UTF-8') as xf:
    doc_type = '<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE dlg:window ' \
               'PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "dialog.dtd">'
    tostring = ET.tostring(root).decode('utf-8')
    file = f"{doc_type}{tostring}"
    xf.write(file)



回答4:


I couldn't find a solution to this problem either using vanilla ElementTree, and the solution proposed by demalexx created non-valid XML that was rejected by my application (DITA). What I propose is a workaround involving other modules and it works perfectly for me.

import re
# found no way for cleanly specify a <!DOCTYPE ...> stanza in ElementTree so
# so we substitute the current <?xml ... ?> stanza with a full <?xml... + <!DOCTYPE...
new_header = '<?xml version="1.0" encoding="UTF-8" ?>\n' \
                 '<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">\n'

target_xml = re.sub(u"\<\?xml .+?>", new_header, source_xml)
with open(filename, 'w') as catalog_file:
    catalog_file.write(target_xml.encode('utf8'))


来源:https://stackoverflow.com/questions/8868248/how-to-create-doctype-with-pythons-celementtree

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!