Error 410 (“resource no longer available”) while getting html code of an url in Python

问题

I am trying to get the html of the following link:

http://www8.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html

To do so, I proceeded as follows:

import requests
try: 
     from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup

url='http://www8.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html'
html=requests.get(url)

And the html code I get (print(html.text)) is the following:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head> 
<title>410 Gone</title>
</head><body>
<h1>Gone</h1>
<p>The requested resource
<br />/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html<br />
is no longer available on this server and there is no forwarding address.
Please remove all references to this resource.</p>
</body></html>

I don't really get why when the link actually exists and its content too. Indeed, if I go to the link and check there the html is way different of that one I am getting. How could I get the actual text content?

Thank you in advance

回答1:

The server seems fussy about which user agent is accessing the resource. You can set your own user agent using the headers parameter of requests.get():

import requests

url = 'http://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html'
headers = {'User-Agent': 'whatever'}

>>> r = requests.get(url)
>>> r
<Response [410]>

>>> r = requests.get(url, headers=headers)
>>> r
<Response [200]>

The server is rejecting requests that contain substrings such as "curl", "python", "wget" etc. in the User-Agent header.

来源：https://stackoverflow.com/questions/49530440/error-410-resource-no-longer-available-while-getting-html-code-of-an-url-in

标签

python

url

html-parsing