Why can't I scrape Amazon by BeautifulSoup?

问题

Here is my python code:

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

it works for google.com and many other websites, but it doesn't work for amazon.com.

I can open amazon.com in my browser, but the resulting "soup" is still none.

Besides, I find that it cannot scrape from appannie.com, either. However, rather than give none, the code returns an error:

HTTPError: HTTP Error 503: Service Temporarily Unavailable

So I doubt whether Amazon and App Annie block scraping.

Please do try by yourself instead of just voting down to the question :(

Thanks

回答1:

Add a header, then it will work.

from bs4 import BeautifulSoup
import requests
url = "http://www.amazon.com/"

# add header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print soup

回答2:

You can try this:

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

In python arbitrary text is called a string and it must be enclosed in quotes(" ").

回答3:

Add a header

import urllib2
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'}

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

来源：https://stackoverflow.com/questions/23555283/why-cant-i-scrape-amazon-by-beautifulsoup

标签

python

beautifulsoup

amazon