Web scraping using Python

问题

I am trying to scrape the website http://www.nseindia.com using urllib2 and BeautifulSoup. Unfortunately, I keep getting 403 Forbidden when I try to access the page through Python. I thought it was a user agent issue, but changing that did not help. Then I thought it may have something to do with cookies, but apparently loading the page through links with cookies turned off works fine. What may be blocking requests through urllib?

回答1:

http://www.nseindia.com/ seems to require an Accept header, for whatever reason. This should work:

import urllib2
r = urllib2.Request('http://www.nseindia.com/')
r.add_header('Accept', '*/*')
r.add_header('User-Agent', 'My scraping program <author@example.com>')
opener = urllib2.build_opener()
content = opener.open(r).read()

Refusing requests without Accept headers is incorrect; RFC 2616 clearly states

If no Accept header field is present, then it is assumed that the client accepts all media types.

来源：https://stackoverflow.com/questions/6969567/web-scraping-using-python

标签

python

urllib2

web-scraping

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!