I am trying to strip XML tags from a document using Python, a language I am a novice in. Here is my first attempt using regex, whixh was really a hope-for-the-best idea.
An alternative to Jeremiah's answer without requiring the lxml external library:
import xml.etree.ElementTree as ET
...
tree = ET.fromstring(Text)
notags = ET.tostring(tree, encoding='utf8', method='text')
print(notags)
Should work with any Python >= 2.5
Please, note, that usually it is not normal to do it by regular expressions. See Jeremiah answer.
Try this:
import re
text = re.sub('<[^<]+>', "", open("/path/to/file").read())
with open("/path/to/file", "w") as f:
f.write(text)
The most reliable way to do this is probably with LXML.
from lxml import etree
...
tree = etree.parse('somefile.xml')
notags = etree.tostring(tree, encoding='utf8', method='text')
print(notags)
It will avoid the problems with "parsing" XML with regular expressions, and should correctly handle escaping and everything.