问题
I have been using Amazon's Product Advertising API to generate urls that contains prices for a given book. One url that I have generated is the following:
http://www.amazon.com/gp/offer-listing/0415376327%3FSubscriptionId%3DAKIAJZY2VTI5JQ66K7QQ%26tag%3Damaztest04-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0415376327
When I click on the link or paste the link on the address bar, the web page loads fine. However, when I execute the following code I get an error:
url = "http://www.amazon.com/gp/offer-listing/0415376327%3FSubscriptionId%3DAKIAJZY2VTI5JQ66K7QQ%26tag%3Damaztest04-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0415376327"
html_contents = urllib2.urlopen(url)
The error is urllib2.HTTPError: HTTP Error 503: Service Unavailable. First of all, I don't understand why I even get this error since the web page successfully loads.
Also, another weird behavior that I have noticed is that the following code sometimes does and sometimes does not give the stated error:
html_contents = urllib2.urlopen("http://www.amazon.com/gp/offer-listing/0415376327%3FSubscriptionId%3DAKIAJZY2VTI5JQ66K7QQ%26tag%3Damaztest04-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0415376327")
I am totally lost on how this behavior occurs. Is there any fix or work around to this? My goal is to read the html contents of the url.
EDIT
I don't know why stack overflow is changing my code to change the amazon link I listed above in my code to rads.stackoverflow. Anyway, ignore the rads.stackoverflow link and use my link above between the quotes.
回答1:
It's because Amazon don't allow automated access to their data, so they're rejecting your request because it didn't come from a proper browser. If you look at the content of the 503 response, it says:
To discuss automated access to Amazon data please contact api-services-support@amazon.com. For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
This is because the User-Agent for Python's urllib is so obviously not a browser. You could always fake the User-Agent, but that's not really good (or moral) practice.
As a side note, as mentioned in another answer, the requests library is really good for HTTP access in Python.
回答2:
Amazon is rejecting the default User-Agent for urllib2 . One workaround is to use the requests module
import requests
page = requests.get("http://www.amazon.com/gp/offer-listing/0415376327%3FSubscriptionId%3DAKIAJZY2VTI5JQ66K7QQ%26tag%3Damaztest04-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0415376327")
html_contents = page.text
If you insist on using urllib2, this is how a header can be faked to do it:
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open('http://www.amazon.com/gp/offer-listing/0415376327%3FSubscriptionId%3DAKIAJZY2VTI5JQ66K7QQ%26tag%3Damaztest04-20%26linkCode%3Dxm2%26camp%3D2025%26creative%3D386001%26creativeASIN%3D0415376327')
html_contents = response.read()
Don't worry about stackoverflow editing the URL. They explain that they are doing this here.
来源:https://stackoverflow.com/questions/25936072/python-urllib2-httperror-http-error-503-service-unavailable-on-valid-website