Python check if website exists

匿名 (未验证) 提交于 2019-12-03 02:33:02

问题:

I wanted to check if a certain website exists, this is what I'm doing:

user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)' headers = { 'User-Agent':user_agent } link = "http://www.abc.com" req = urllib2.Request(link, headers = headers) page = urllib2.urlopen(req).read() - ERROR 402 generated here! 

If the page doesn't exist (error 402, or whatever other errors), what can I do in the page = ... line to make sure that the page I'm reading does exit?

回答1:

You can use HEAD request instead of GET. It will only download the header, but not the content. Then you can check the response status from the headers.

import httplib c = httplib.HTTPConnection('www.example.com') c.request("HEAD", '') if c.getresponse().status == 200:    print('web site exists') 

or you can use urllib2

import urllib2 try:     urllib2.urlopen('http://www.example.com/some_page') except urllib2.HTTPError, e:     print(e.code) except urllib2.URLError, e:     print(e.args) 

or you can use requests

import requests request = requests.get('http://www.example.com') if request.status_code == 200:     print('Web site exists') else:     print('Web site does not exist')  


回答2:

It's better to check that status code is here. Here is what do status codes mean (taken from wikipedia):

  • 1xx - informational
  • 2xx - success
  • 3xx - redirection
  • 4xx - client error
  • 5xx - server error

If you want to check if page exists and don't want to download the whole page, you should use Head Request:

import httplib2 h = httplib2.Http() resp = h.request("http://www.google.com", 'HEAD') assert int(resp[0]['status']) 

taken from this answer.

If you want to download the whole page, just make a normal request and check the status code. Example using requests:

import requests  response = requests.get('http://google.com') assert response.status_code 

See also similar topics:

Hope that helps.



回答3:

from urllib2 import Request, urlopen, HTTPError, URLError  user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)' headers = { 'User-Agent':user_agent } link = "http://www.abc.com/" req = Request(link, headers = headers) try:         page_open = urlopen(req) except HTTPError, e:         print e.code except URLError, e:         print e.reason else:         print 'ok' 

To answer the comment of unutbu:

Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range. Source



回答4:

code:

a="http://www.example.com" try:         print urllib.urlopen(a) except:     print a+"  site does not exist" 


回答5:

def isok(mypath):     try:         thepage = urllib.request.urlopen(mypath)     except HTTPError as e:         return 0     except URLError as e:         return 0     else:         return 1 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!