Extract absolute links from a page using HTMLParser

北城余情 提交于 2019-12-10 20:39:22

问题


I'm using the following snippet to extract all the links on a page using HTMLParser. I get quite a few relative URLs. How can I convert these to absolute URLs for a domain e.g. www.exmaple.com

import htmllib, formatter
import urllib, htmllib, formatter

class LinksExtractor(htmllib.HTMLParser):

   def __init__(self, formatter):
      htmllib.HTMLParser.__init__(self, formatter)
      self.links = []

   def start_a(self, attrs):
      if len(attrs) > 0 :
         for attr in attrs :
            if attr[0] == "href":
                self.links.append(attr[1])

   def get_links(self):
      return self.links


format = formatter.NullFormatter()
htmlparser = LinksExtractor(format)

data = urllib.urlopen("http://cis.poly.edu/index.htm")
htmlparser.feed(data.read())
htmlparser.close()

links = htmlparser.get_links()
print links

Thanks


回答1:


You want

urlparse.urljoin(base, url[, allow_fragments])

http://docs.python.org/library/urlparse.html#urlparse.urljoin

This allows you to give an absolute or base url, and join it with a relative url. Even if they have overlapping pieces, it should work.



来源:https://stackoverflow.com/questions/6816138/extract-absolute-links-from-a-page-using-htmlparser

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!