Have HTMLParser differentiate between link-text and other data?

问题

Say I have html code similar to this:

<a href="http://example.org/">Stuff I do want</a>
<p>Stuff I don't want</p>

Using HTMLParser's handle_data doesn't differentiate between the link-text(stuff I do want)(Is this even the right term?) and the stuff I don't want. Does HTMLParser have a built-in way to have handle_data return only link-text and nothing else?

回答1:

Basically you have to write a handle_starttag() method as well. Just save off every tag you see as self.lasttag or something. Then, in your handle_data() method, just check self.lasttag and see if it's 'a' (indicating that the last tag you saw was an HTML anchor tag and therefore you're in a link).

Something like this (untested) should work:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    lasttag = None

    def handle_starttag(self, tag, attr):
        self.lasttag = tag.lower()

    def handle_data(self, data):
        if self.lasttag == "a" and data.strip():
            print data

In fact it's permissible in HTML to have other tags inside an <a...> ... </a> container. And there can also be anchors that contain text but aren't links (no href= attribute). These cases can both be handled if desired. Again, this code is untested:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    inlink = False
    data   = []

    def handle_starttag(self, tag, attr):
        if tag.lower() == "a" and "href" in (k.lower() for k, v in attr):
           self.inlink = True
           self.data   = []

    def handle_endtag(self, tag):
        if tag.lower() == "a":
            self.inlink = False
            print "".join(self.data)

    def handle_data(self, data):
        if self.inlink:
            self.data.append(data)

HTMLParser is what you'd call a SAX-style parser, which notifies you of the tags going by but makes you keep track of the tag hierarchy yourself. You can see how complicated this can get just by the differences between the first and second versions here.

DOM-style parsers are easier to work with for these kinds of tasks because they read the whole document into memory and produce a tree that is easily navigated and searched. DOM-style parsers tend to use more memory and be slower than SAX-style parsers, but this is much less important now than it was ten years ago.

来源：https://stackoverflow.com/questions/9404309/have-htmlparser-differentiate-between-link-text-and-other-data

标签

python

html-parsing