问题
There's a xml file:
<body>
<entry>
I go to <hw>to</hw> to school.
</entry>
</body>
For some reason, I changed <hw>
to <hw>
and </hw>
to </hw>
before parsing it with lxml parser.
<body>
<entry>
I go to <hw>to</hw> to school.
</entry>
</body>
But after modifying the parsed xml data, I want to get a <hw>
element, not <hw>
. How can I do that?
回答1:
First find a unescape
function:
from xml.sax.saxutils import unescape
entry=body[0]
unescape and replace it with the original:
body.replace(entry, e.fromstring(unescape(e.tounicode(entry))))
回答2:
If you know which element contains wrongly escaped elements:
# parse whole document as usual..
# find the entry element..
# parse the fragment
fragment = lxml.fromstring(entry.text)
# (optionally) add the fragment to the tree
entry.text = None
entry.append(fragment)
来源:https://stackoverflow.com/questions/14659423/how-to-convert-lt-into-in-lxml-python