Is it possible to hook up a more robust HTML parser to Python mechanize?

本小妞迷上赌 提交于 2019-11-28 07:46:33

reading from the big example on the first page of the mechanize website:

# Sometimes it's useful to process bad headers or bad HTML:
response = br.response()  # this is a copy of response
headers = response.info()  # currently, this is a mimetools.Message
headers["Content-type"] = "text/html; charset=utf-8"
response.set_data(response.get_data().replace("<!---", "<!--"))
br.set_response(response)

so it seems very possible to preprocess the response with another parser which will regenerate well-formed HTML, then feed it back to mechanize for further processing.

cerberos

I had a problem where a form field was missing from a form, I couldn't find any malformed html but I figured that was the cause so I used BeautifulSoup's prettify function to parse it and it worked.

resp = br.open(url)
soup = BeautifulSoup(resp.get_data())
resp.set_data(soup.prettify())
br.set_response(resp)

I'd love to know how to this automatically.

Edit: found out how to do this automatically

class PrettifyHandler(mechanize.BaseHandler):
    def http_response(self, request, response):
        if not hasattr(response, "seek"):
            response = mechanize.response_seek_wrapper(response)
        # only use BeautifulSoup if response is html
        if response.info().dict.has_key('content-type') and ('html' in response.info().dict['content-type']):
            soup = BeautifulSoup(response.get_data())
            response.set_data(soup.prettify())
        return response

    # also parse https in the same way
    https_response = http_response

br = mechanize.Browser()
br.add_handler(PrettifyHandler())

br will now use BeautifulSoup to parse all responses where html is contained in the content type (mime type), eg text/html

What you're looking for can be done with lxml.etree which is the xml.etree.ElementTree emulator (and replacement) provided by lxml:

First we take bad mal-formed HTML:

% cat bad.html
<html>
<HEAD>
    <TITLE>this HTML is awful</title>
</head>
<body>
    <h1>THIS IS H1</H1>
    <A HREF=MYLINK.HTML>This is a link and it is awful</a>
    <img src=yay.gif>
</body>
</html>

(Observe the mixed case between opening and closing tags, missing quotation marks).

And then parse it:

>>> from lxml import etree
>>> bad = file('bad.html').read()
>>> html = etree.HTML(bad)
>>> print etree.tostring(html)
<html><head><title>this HTML is awful</title></head><body>
    <h1>THIS IS H1</h1>
    <a href="MYLINK.HTML">This is a link and it is awful</a>
    <img src="yay.gif"/></body></html>

Observe that the tagging and quotation has been corrected for us.

If you were having problems parsing the HTML before, this might be the answer you're looking for. As for the details of HTTP, that is another matter entirely.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!