I\'m using ElementTree to parse an XML document that I have. I am getting the text from the u
tags. Some of them have mixed content that I need to filter out or kee
The lost text bits, "¿Sí?" and "A mí no me suena.", are available as the tail property of each
element (the text following the element's end tag).
Here is a way to get the wanted output (tested with Python 2.7).
Assume that vocal.xml looks like this:
eh
¿Sí?
Pues...
laugh
A mí no me suena.
Code:
from xml.etree import ElementTree as ET
root = ET.parse("vocal.xml")
for u in root.findall(".//u"):
v = u.find("vocal")
if v.get("type") == "filler":
frags = [u.text, v.findtext("desc"), v.tail]
else:
frags = [u.text, v.tail]
print " ".join(t.encode("utf-8").strip() for t in frags).strip()
Output:
eh ¿Sí?
Pues... A mí no me suena.