问题
I'm having a small issue with lxml. I'm converting an XML doc into an HTML doc. The original XML looks like this (it looks like HTML, but it's in the XML doc):
<p>Localization - Eiffel tower? Paris or Vegas <p>Bayes theorem p(A|B)</p></p>
When I do this (item is the string above)
lxml.html.tostring(lxml.html.fromstring(item))
I get this:
<div><p>Localization - Eiffel tower? Paris or Vegas </p><p>Bayes theorem p(A|B)</p></div>
I don't have any problem with the <div>s, but the fact that the 'Bayes theorem' paragraph is no longer nested within the outer paragraph is a problem.
Anyone know why lxml is doing this and how to stop it? Thanks.
回答1:
lxml is doing this because it doesn't store invalid HTML, and <p>
elements can't be nested in HTML:
The P element represents a paragraph. It cannot contain block-level elements (including P itself).
回答2:
You're using lxml's HTML parser, not an XML parser. Try this instead:
>>> from lxml import etree
>>> item = '<p>Eiffel tower? Paris or Vegas <p>Bayes theorem p(A|B)</p></p>'
>>> root = etree.fromstring(item)
>>> etree.tostring(root, pretty_print=True)
'<p>Eiffel tower? Paris or Vegas <p>Bayes theorem p(A|B)</p></p>\n'
来源:https://stackoverflow.com/questions/7180919/python-lxml-changes-tag-hierarchy