Python strip XML tags from document

前端 未结 3 524
青春惊慌失措
青春惊慌失措 2020-12-19 00:44

I am trying to strip XML tags from a document using Python, a language I am a novice in. Here is my first attempt using regex, whixh was really a hope-for-the-best idea.

相关标签:
3条回答
  • 2020-12-19 01:12

    An alternative to Jeremiah's answer without requiring the lxml external library:

    import xml.etree.ElementTree as ET
    ...
    tree = ET.fromstring(Text)
    notags = ET.tostring(tree, encoding='utf8', method='text')
    print(notags)
    

    Should work with any Python >= 2.5

    0 讨论(0)
  • 2020-12-19 01:13

    Please, note, that usually it is not normal to do it by regular expressions. See Jeremiah answer.

    Try this:

    import re
    
    text = re.sub('<[^<]+>', "", open("/path/to/file").read())
    with open("/path/to/file", "w") as f:
        f.write(text)
    
    0 讨论(0)
  • 2020-12-19 01:23

    The most reliable way to do this is probably with LXML.

    from lxml import etree
    ...
    tree = etree.parse('somefile.xml')
    notags = etree.tostring(tree, encoding='utf8', method='text')
    print(notags)
    

    It will avoid the problems with "parsing" XML with regular expressions, and should correctly handle escaping and everything.

    0 讨论(0)
提交回复
热议问题