I try to get the whole content between an opening xml tag and it\'s closing counterpart.
Getting the content in straight cases like title
below is easy
That is considerably easy with lxml*, using the parse()
and tostring()
functions:
from lxml.etree import parse, tostring
First you parse the doc and get your element (I am using XPath, but you can use whatever you want):
doc = parse('test.xml')
element = doc.xpath('//text')[0]
The tostring()
function returns a text representation of your element:
>>> tostring(element)
'Some text with data in it. \n'
However, you do not want the external elements, so we can remove them with a simple str.replace()
call:
>>> tostring(element).replace('<%s>'%element.tag, '', 1)
'Some text with data in it.\n'
Note that str.replace()
received 1 as the third parameter, so it will remove only the first occurrence of the opening tag. One can do it with the closing tag, too. Now, instead of 1, we pass -1 to replace:
>>> tostring(element).replace('%s>'%element.tag, '', -1)
'Some text with data in it.\n'
The solution, of course, is to do everything at once:
>>> tostring(element).replace('<%s>'%element.tag, '', 1).replace('%s>'%element.tag, '', -1)
'Some text with data in it.\n'
EDIT: @Charles made a good point: this code is fragile since the tag can have attributes. A possible yet still limited solution is to split the string at the first >
:
>>> tostring(element).split('>', 1)
['text with data in it.\n']
get the second resulting string:
>>> tostring(element).split('>', 1)[1]
'Some text with data in it.\n'
then rsplitting it:
>>> tostring(element).split('>', 1)[1].rsplit('', 1)
['Some text with data in it.', 'text>\n']
and finally getting the first result:
>>> tostring(element).split('>', 1)[1].rsplit('', 1)[0]
'Some text with data in it.'
Nonetheless, this code is still fragile, since >
is a perfectly valid char in XML, even inside attributes.
Anyway, I have to acknowledge that MattH solution is the real, general solution.
* Actually this solution works with ElementTree, too, which is great if you do not want to depend upon lxml. The only difference is that you will have no way of using XPath.