How to use Python's HTMLParser to extract specific links

醉酒当歌 提交于 2019-12-13 17:42:53

问题


I've been working on a basic web crawler in Python using the HTMLParser Class. I fetch my links with a modified handle_starttag method that looks like this:

def handle_starttag(self, tag, attrs):
    if tag == 'a':
        for (key, value) in attrs:
            if key == 'href':
                newUrl = urljoin(self.baseUrl, value)
                self.links = self.links + [newUrl]

This worked very well when I wanted to find every link on the page. Now I only want to fetch certain links.

How would I go about only fetching links that are between the <td class="title"> and </td> tags, like this:

<td class="title"><a href="http://www.stackoverflow.com">StackOverflow</a><span class="comhead"> (arstechnica.com) </span></td>

回答1:


HTMLParser is a SAX-style or streaming parser, which means that you get pieces of the document as they are parsed, but not the whole document at once. The parser calls methods you provide to handle tags and other types of data. Any context you may be interested yourself, such as which tags are inside other tags, you must glean from the tags you see passing by.

For example, if you see a <td> tag, then you know you are in a table cell, and can set a flag to that effect. When you see </td>, you know you have left a table cell and can clear that flag. To get the links inside a table cell, then, if you see <a> and you know that you are in a table cell (because of that flag you set), you grab the value of the tag's href attribute if it has one.

from HTMLParser import HTMLParser

class LinkExctractor(HTMLParser):

    def reset(self):
        HTMLParser.reset(self)
        self.extracting = False
        self.links      = []

    def handle_startag(self, tag, attrs):
        if tag == "td" or tag == "a":
            attrs = dict(attrs)   # save us from iterating over the attrs
        if tag == "td" and attrs.get("class", "") == "title":
            self.extracting = True
        elif tag == "a" and "href" in attrs and self.extracting:
            self.links.append(attrs["href"])

    def handle_endtag(self, tag):
        if tag == "td":
            self.extracting = False

This quickly gets to be a pain as you need more and more context to get what you want from the document, which is why people are recommending lxml and BeautifulSoup. These are DOM-style parsers that keep track of the document hierarchy for you and provide various friendly ways to navigate it, such as a DOM API, XPath, and/or CSS selectors.

BTW, I answered a similar question recently here.



来源:https://stackoverflow.com/questions/9694769/how-to-use-pythons-htmlparser-to-extract-specific-links

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!