How do I get the whole content between two xml tags in Python?

后端未结

关注

 5  1594

我寻月下人不归 2020-12-15 09:13

I try to get the whole content between an opening xml tag and it\'s closing counterpart.

Getting the content in straight cases like title below is easy

5条回答

春和景丽 (楼主)

2020-12-15 09:28
That is considerably easy with lxml*, using the parse() and tostring() functions:
```
from  lxml.etree import parse, tostring
```
First you parse the doc and get your element (I am using XPath, but you can use whatever you want):
```
doc = parse('test.xml')
element = doc.xpath('//text')[0]
```
The tostring() function returns a text representation of your element:
```
>>> tostring(element)
'Some text with data in it.\n'
```
However, you do not want the external elements, so we can remove them with a simple str.replace() call:
```
>>> tostring(element).replace('<%s>'%element.tag, '', 1)
'Some text with data in it.\n'
```
Note that str.replace() received 1 as the third parameter, so it will remove only the first occurrence of the opening tag. One can do it with the closing tag, too. Now, instead of 1, we pass -1 to replace:
```
>>> tostring(element).replace(''%element.tag, '', -1)
'Some text with data in it.\n'
```
The solution, of course, is to do everything at once:
```
>>> tostring(element).replace('<%s>'%element.tag, '', 1).replace(''%element.tag, '', -1)
'Some text with data in it.\n'
```
EDIT: @Charles made a good point: this code is fragile since the tag can have attributes. A possible yet still limited solution is to split the string at the first >:
```
>>> tostring(element).split('>', 1)
['text with data in it.\n']
```
get the second resulting string:
```
>>> tostring(element).split('>', 1)[1]
'Some text with data in it.\n'
```
then rsplitting it:
```
>>> tostring(element).split('>', 1)[1].rsplit('text with data in it.', 'text>\n']
```
and finally getting the first result:
```
>>> tostring(element).split('>', 1)[1].rsplit('text with data in it.'
```
Nonetheless, this code is still fragile, since > is a perfectly valid char in XML, even inside attributes.

Anyway, I have to acknowledge that MattH solution is the real, general solution.

* Actually this solution works with ElementTree, too, which is great if you do not want to depend upon lxml. The only difference is that you will have no way of using XPath.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...