HTMLParser misunderstands entities in href. Is it a bug or not? Should I report it?

问题

I don't want to know how to solve the problem, because I have solved it on my own. I'm just asking if it is really a bug and whether and how I should report it. You can find the code and the output below:

from html.parser import HTMLParser

class MyParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        for at in attrs:
            if at[0] == 'href':
                print(at[1])
        return super().handle_starttag(tag, attrs)

    def handle_data(self, data):
        return super().handle_data(data)

    def handle_endtag(self, tag):
        return super().handle_endtag(tag)



s = '<a href="/home?ID=123&gt3=7">nomeLink</a>'

p = MyParser()
p.feed(s)

The following is the output:

"/home?ID=123>3=7"

回答1:

No, it is not a bug. You are feeding the parser invalid HTML, the correct way to include & in a URL in a HTML attribute is to escape it to &:

>>> s = '<a href="/home?ID=123&amp;gt3=7">nomeLink</a>'
>>> p = MyParser()
>>> p.feed(s)
/home?ID=123&gt3=7

The parser did their best (as required by the HTML standard) and gave you 'repaired' data to the best of its ability. In this case, it tried to repair another common broken-HTML error: spelling > as &gt (forgetting the ; semicolon).

Rather than build on top of the (rather low-level) html.parser library yourself, I recommend you use BeautifulSoup instead. BeautifulSoup supports multiple parsers, and some of those can handle broken HTML better than others.

For example, the html5lib parser can handle unescaped ampersands in attributes better than html.parser can:

>>> from bs4 import BeautifulSoup
>>> s = '<a href="/home?ID=123&gt3=7">nomeLink</a>'
>>> BeautifulSoup(s, 'html.parser').find('a')['href']
'/home?ID=123>3=7'
>>> BeautifulSoup(s, 'html5lib').find('a')['href']
'/home?ID=123&gt3=7'

For completeness sake, the third supported parser, lxml, also handles unescaped ampersands as if they are escaped:

>>> BeautifulSoup(s, 'lxml').find('a')['href']
'/home?ID=123&gt3=7'

You could use lxml and html5lib directly, but then you'd forgo the nice high-level API that BeautifulSoup offers.

回答2:

Python 3.3.2 (v3.3.2, May 16 2013, 00:03:43) [MSC v.1600 32 bit (Intel)] on win32

Let feed s = '<p a="'">' to MyHTMLParser:

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print(attrs)

This is valid html tag where ' is for '. In this case MyHTMLParser gives for attrs:

[('a', "'")]

The reason of such result is the usage of unescape function:

Lines in source file html/parser.py, class HTMLParser
348:            if attrvalue:
349:                attrvalue = self.unescape(attrvalue)

where self.unescape is an internal helper to remove special character quoting, which is used for attributes values only. See lines 504-532 in parser.py.

来源：https://stackoverflow.com/questions/26072209/htmlparser-misunderstands-entities-in-href-is-it-a-bug-or-not-should-i-report

标签

python

html

python-3.x

html-entities

html-parser