Python's `urllib2`: Why do I get error 403 when I `urlopen` a Wikipedia page?

前端 未结 6 1182
轻奢々
轻奢々 2020-12-07 11:07

I have a strange bug when trying to urlopen a certain page from Wikipedia. This is the page:

http://en.wikipedia.org/wiki/OpenCola_(drink)

This

6条回答
  •  死守一世寂寞
    2020-12-07 11:41

    Wikipedias stance is:

    Data retrieval: Bots may not be used to retrieve bulk content for any use not directly related to an approved bot task. This includes dynamically loading pages from another website, which may result in the website being blacklisted and permanently denied access. If you would like to download bulk content or mirror a project, please do so by downloading or hosting your own copy of our database.

    That is why Python is blocked. You're supposed to download data dumps.

    Anyways, you can read pages like this in Python 2:

    req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    con = urllib2.urlopen( req )
    print con.read()
    

    Or in Python 3:

    import urllib
    req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
    con = urllib.request.urlopen( req )
    print(con.read())
    

提交回复
热议问题