Why can't I scrape Amazon by BeautifulSoup?

女生的网名这么多〃 提交于 2019-12-04 02:58:13

问题


Here is my python code:

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

it works for google.com and many other websites, but it doesn't work for amazon.com.

I can open amazon.com in my browser, but the resulting "soup" is still none.

Besides, I find that it cannot scrape from appannie.com, either. However, rather than give none, the code returns an error:

HTTPError: HTTP Error 503: Service Temporarily Unavailable 

So I doubt whether Amazon and App Annie block scraping.

Please do try by yourself instead of just voting down to the question :(

Thanks


回答1:


Add a header, then it will work.

from bs4 import BeautifulSoup
import requests
url = "http://www.amazon.com/"

# add header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print soup



回答2:


You can try this:

import urllib2
from bs4 import BeautifulSoup

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

In python arbitrary text is called a string and it must be enclosed in quotes(" ").




回答3:


Add a header

import urllib2
from bs4 import BeautifulSoup

headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'}

page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup


来源:https://stackoverflow.com/questions/23555283/why-cant-i-scrape-amazon-by-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!