How to parse HTML with entities such as   using builtin library ElementTree in Python 2 & Python 3?

一个人想着一个人 提交于 2019-11-29 08:55:05

Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.

This works for both Python 2.6, 2.7, 3.3, 3.4.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''

magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
            <!ENTITY nbsp ' '>
            ]>'''  # You can define more entities here, if needed

et = ET.fromstring(magic + html)
user9456810

As another alternative answer, setting the attribute "entity" of the parser worked for me:

parser.entity["nbsp"] = ' '
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!