I try to get the whole content between an opening xml tag and it\'s closing counterpart.
Getting the content in straight cases like title below is easy
I like @Marcin's solution above, however I found that when using his 2nd option (converting a sub-node, not the root of the tree) it does not handle entities.
His code from above (modified to add an entity):
from lxml import etree
t = etree.XML("""
Some testing stuff
this & that.
""")
e = t.xpath('//text')[0]
print (e.text + ''.join(map(etree.tostring, e))).strip()
returns:
this & that.
with a bare/unescaped '&' character instead of a proper entity ('&').
My solution was to use to call etree.tostring at the node level (instead of on all children), then strip off the starting and ending tag using a regular expression:
import re
from lxml import etree
t = etree.XML("""
Some testing stuff
this & that.
""")
e = t.xpath('//text')[0]
xml = etree.tostring(e)
inner = re.match('<[^>]*?>(.*)[^>]*>\s*$', xml, flags=re.DOTALL).group(1)
print inner
produces:
this & that.
I used re.DOTALL to ensure this works for XML containing newlines.