Python lxml changes tag hierarchy?

问题

I'm having a small issue with lxml. I'm converting an XML doc into an HTML doc. The original XML looks like this (it looks like HTML, but it's in the XML doc):

<p>Localization - Eiffel tower? Paris or Vegas <p>Bayes theorem p(A|B)</p></p>

When I do this (item is the string above)

lxml.html.tostring(lxml.html.fromstring(item))

I get this:

<div><p>Localization - Eiffel tower? Paris or Vegas </p><p>Bayes theorem p(A|B)</p></div>

I don't have any problem with the <div>s, but the fact that the 'Bayes theorem' paragraph is no longer nested within the outer paragraph is a problem.

Anyone know why lxml is doing this and how to stop it? Thanks.

回答1:

lxml is doing this because it doesn't store invalid HTML, and <p> elements can't be nested in HTML:

The P element represents a paragraph. It cannot contain block-level elements (including P itself).

回答2:

You're using lxml's HTML parser, not an XML parser. Try this instead:

>>> from lxml import etree
>>> item = '<p>Eiffel tower? Paris or Vegas <p>Bayes theorem p(A|B)</p></p>'
>>> root = etree.fromstring(item)
>>> etree.tostring(root, pretty_print=True)
'<p>Eiffel tower? Paris or Vegas <p>Bayes theorem p(A|B)</p></p>\n'

来源：https://stackoverflow.com/questions/7180919/python-lxml-changes-tag-hierarchy

标签

python

html

xml

lxml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!