How do I get the whole content between two xml tags in Python?

后端 未结 5 1609
我寻月下人不归
我寻月下人不归 2020-12-15 09:13

I try to get the whole content between an opening xml tag and it\'s closing counterpart.

Getting the content in straight cases like title below is easy

5条回答
  •  醉酒成梦
    2020-12-15 09:17

    I like @Marcin's solution above, however I found that when using his 2nd option (converting a sub-node, not the root of the tree) it does not handle entities.

    His code from above (modified to add an entity):

    from lxml import etree
    t = etree.XML("""
    
      Some testing stuff
        this & that.
    """)
    e = t.xpath('//text')[0]
    print (e.text + ''.join(map(etree.tostring, e))).strip()
    

    returns:

    this & that.
    

    with a bare/unescaped '&' character instead of a proper entity ('&').

    My solution was to use to call etree.tostring at the node level (instead of on all children), then strip off the starting and ending tag using a regular expression:

    import re
    from lxml import etree
    t = etree.XML("""
    
      Some testing stuff
        this & that.
    """)
    
    e = t.xpath('//text')[0]
    xml = etree.tostring(e)
    inner = re.match('<[^>]*?>(.*)]*>\s*$', xml, flags=re.DOTALL).group(1)
    print inner
    

    produces:

    this & that.
    

    I used re.DOTALL to ensure this works for XML containing newlines.

提交回复
热议问题