可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

How can I retrieve the links of a webpage and copy the url address of the links using Python?

回答1:

Here's a short snippet using the SoupStrainer class in BeautifulSoup:

import httplib2 from BeautifulSoup import BeautifulSoup, SoupStrainer  http = httplib2.Http() status, response = http.request('http://www.nytimes.com')  for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):     if link.has_attr('href'):         print link['href']

The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:

http://www.crummy.com/software/BeautifulSoup/documentation.html

Edit: Note that I used the SoupStrainer class because it's a bit more efficient (memory and speed wise), if you know what you're parsing in advance.

回答2:

Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

lxml.html also supports CSS3 selectors so this sort of thing is trivial.

An example with lxml and xpath would look like this:

import urllib import lxml.html connection = urllib.urlopen('http://www.nytimes.com')  dom =  lxml.html.fromstring(connection.read())  for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)     print link

回答3:

For completeness sake, the BeautifulSoup 4 version, making use of the encoding supplied by the server as well:

from bs4 import BeautifulSoup import urllib2  resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks") soup = BeautifulSoup(resp, from_encoding=resp.info().getparam('charset'))  for link in soup.find_all('a', href=True):     print link['href']

or the Python 3 version:

from bs4 import BeautifulSoup import urllib.request  resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks") soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))  for link in soup.find_all('a', href=True):     print(link['href'])

and a version using the requests library, which as written will work in both Python 2 and 3:

from bs4 import BeautifulSoup from bs4.dammit import EncodingDetector import requests  resp = requests.get("http://www.gpsbasecamp.com/national-parks") http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True) encoding = html_encoding or http_encoding soup = BeautifulSoup(resp.content, from_encoding=encoding)  for link in soup.find_all('a', href=True):     print(link['href'])

The soup.find_all('a', href=True) call finds all elements that have an href attribute; elements without the attribute are skipped.

BeautifulSoup 3 stopped development in March 2012; new projects really should use BeautifulSoup 4, always.

Note that you should leave decoding the HTML from bytes to BeautifulSoup. You can inform BeautifulSoup of the characterset found in the HTTP response headers to assist in decoding, but this can be wrong and conflicting with a header info found in the HTML itself, which is why the above uses the BeautifulSoup internal class method EncodingDetector.find_declared_encoding() to make sure that such embedded encoding hints win over a misconfigured server.

With requests, the response.encoding attribute defaults to Latin-1 if the response has a text/* mimetype, even if no characterset was returned. This is consistent with the HTTP RFCs but painful when used with HTML parsing, so you should ignore that attribute when no charset is set in the Content-Type header.

回答4:

import urllib2 import BeautifulSoup  request = urllib2.Request("http://www.gpsbasecamp.com/national-parks") response = urllib2.urlopen(request) soup = BeautifulSoup.BeautifulSoup(response) for a in soup.findAll('a'):   if 'national-park' in a['href']:     print 'found a url with national-park in the link'

回答5:

The following code is to retrieve all the links available in a webpage using urllib2 and BeautifulSoup4

    import urllib2     from bs4 import BeautifulSoup     url = urllib2.urlopen("http://www.espncricinfo.com/").read()     soup = BeautifulSoup(url)     for line in soup.find_all('a'):             print(line.get('href'))

回答6:

Under the hood BeautifulSoup now uses lxml. Requests, lxml & list comprehensions makes a killer combo.

import requests import lxml.html  dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)  [x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

In the list comp, the "if '//' and 'url.com' not in x" is a simple method to scrub the url list of the sites 'internal' navigation urls, etc.

回答7:

To find all the links, we will in this example use the urllib2 module together with the re.module *One of the most powerful function in the re module is "re.findall()". While re.search() is used to find the first match for a pattern, re.findall() finds all the matches and returns them as a list of strings, with each string representing one match*

import urllib2  import re #connect to a URL website = urllib2.urlopen(url)  #read html code html = website.read()  #use re.findall to get all the links links = re.findall('"((http|ftp)s?://.*?)"', html)  print links

回答8:

just for getting the links, without B.soup and regex:

import urllib2 url="http://www.somewhere.com" page=urllib2.urlopen(url) data=page.read().split("") tag="" for item in data:     if "

for more complex operations, of course BSoup is still preferred.

回答9:

This script does what your looking for, But also resolves the relative links to absolute links.

import urllib import lxml.html import urlparse  def get_dom(url):     connection = urllib.urlopen(url)     return lxml.html.fromstring(connection.read())  def get_links(url):     return resolve_links((link for link in get_dom(url).xpath('//a/@href')))  def guess_root(links):     for link in links:         if link.startswith('http'):             parsed_link = urlparse.urlparse(link)             scheme = parsed_link.scheme + '://'             netloc = parsed_link.netloc             return scheme + netloc  def resolve_links(links):     root = guess_root(links)     for link in links:         if not link.startswith('http'):             link = urlparse.urljoin(root, link)         yield link    for link in get_links('http://www.google.com'):     print link

回答10:

Why not use regular expressions:

import urllib2 import re url = "http://www.somewhere.com" page = urllib2.urlopen(url) page = page.read() links = re.findall(r"(.*?)", page) for link in links:     print('href: %s, HTML text: %s' % (link[0], link[1]))

回答11:

BeatifulSoup's own parser can be slow. It might be more feasible to use lxml which is capable of parsing directly from a URL (with some limitations mentioned below).

import lxml.html  doc = lxml.html.parse(url)  links = doc.xpath('//a[@href]')  for link in links:     print link.attrib['href']

The code above will return the links as is, and in most cases they would be relative links or absolute from the site root. Since my use case was to only extract a certain type of links, below is a version that converts the links to full URLs and which optionally accepts a glob pattern like *.mp3. It won't handle single and double dots in the relative paths though, but so far I didn't have the need for it. If you need to parse URL fragments containing ../ or ./ then urlparse.urljoin might come in handy.

NOTE: Direct lxml url parsing doesn't handle loading from https and doesn't do redirects, so for this reason the version below is using urllib2 + lxml.

#!/usr/bin/env python import sys import urllib2 import urlparse import lxml.html import fnmatch  try:     import urltools as urltools except ImportError:     sys.stderr.write('To normalize URLs run: `pip install urltools --user`')     urltools = None   def get_host(url):     p = urlparse.urlparse(url)     return "{}://{}".format(p.scheme, p.netloc)   if __name__ == '__main__':     url = sys.argv[1]     host = get_host(url)     glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'      doc = lxml.html.parse(urllib2.urlopen(url))     links = doc.xpath('//a[@href]')      for link in links:         href = link.attrib['href']          if fnmatch.fnmatch(href, glob_patt):              if not href.startswith(('http://', 'https://' 'ftp://')):                  if href.startswith('/'):                     href = host + href                 else:                     parent_url = url.rsplit('/', 1)[0]                     href = urlparse.urljoin(parent_url, href)                      if urltools:                         href = urltools.normalize(href)              print href

The usage is as follows:

getlinks.py http://stackoverflow.com/a/37758066/191246 getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*" getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

回答12:

import urllib2 from bs4 import BeautifulSoup a=urllib2.urlopen('http://dir.yahoo.com') code=a.read() soup=BeautifulSoup(code) links=soup.findAll("a") #To get href part alone print links[0].attrs['href']

回答13:

Here's an example using @ars accepted answer and the BeautifulSoup4, requests, and wget modules to handle the downloads.

import requests import wget import os  from bs4 import BeautifulSoup, SoupStrainer  url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/' file_type = '.tar.gz'  response = requests.get(url)  for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):     if link.has_attr('href'):         if file_type in link['href']:             full_path = url + link['href']             wget.download(full_path)

回答14:

I found the answer by @Blairg23 working , after the following correction (covering the scenario where it failed to work correctly):

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):     if link.has_attr('href'):         if file_type in link['href']:             full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported             wget.download(full_path)

For Python 3:

urllib.parse.urljoin has to be used in order to obtain the full URL instead.

文章来源: retrieve links from web page using python and BeautifulSoup

标签

urllib2