Issue in reading text in XML using python

a 夏天 提交于 2019-12-12 05:03:47

问题


I am trying to read the following XML file which has following content:

<tu creationdate="20100624T160543Z" creationid="SYSTEM" usagecount="0">
    <prop type="x-source-tags">1=A,2=B</prop>
    <prop type="x-target-tags">1=A,2=B</prop>
    <tuv xml:lang="EN">
      <seg>Modified <ut x="1"/>Denver<ut x="2"/> Score</seg>
    </tuv>
    <tuv xml:lang="DE">
      <seg>Modifizierter <ut x="1"/>Denver<ut x="2"/>-Score</seg>
    </tuv>
  </tu>

using the following code

tree = ET.parse(tmx)
root = tree.getroot()
seg = root.findall('.//seg')
for n in seg:
   print(n.text)

It gave the following output:

Modified
Modifizierter

What I am expecting was

Modified Denver Score
Modifizierter Denver -Score

Can someone explain why only part of seg is displayed?


回答1:


You need to be aware of the tail property, which is the text that follows an element's end tag. It is explained well here: http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/etree-view.html.

"Denver" is the tail of the first <ut> element and " Score" is the tail of the second <ut> element. These strings are not part of the text of the <seg> element.

In addition to the solution provided by kgbplus (which works with both ElementTree and lxml), with lxml you can also use the following methods to get the wanted output:

  1. xpath()

    for n in seg:
        print("".join(n.xpath("text()")))
    
  2. itertext()

    for n in seg:
        print("".join(n.itertext()))
    



回答2:


You can use tostring function:

tree = ET.parse(tmx)
root = tree.getroot()
seg = root.findall('.//seg')
for n in seg:
   print(ET.tostring(n, method="text"))

In your case resulting string may contain unnecessary symbols, so you can modify last line like this:

print(ET.tostring(n, method="text").strip())


来源:https://stackoverflow.com/questions/46221545/issue-in-reading-text-in-xml-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!