urllib2 | 易学教程

urllib2 HTTP error 429

阅读更多关于 urllib2 HTTP error 429

So I have a list of sub-reddits and I'm using urllib to open them. As I go through them eventually urllib fails with: urllib2.HTTPError: HTTP Error 429: Unknown Doing some research I found that reddit limits the ammount of requests to their servers by IP: Make no more than one request every two seconds. There's some allowance for bursts of requests, but keep it sane. In general, keep it to no more than 30 requests in a minute. So I figured I'd use time.sleep() to limit my requests to one page each 10 seconds. This ends up failing just as well. The quote above is grabbed from the reddit API

Downloading a web page and all of its resource files in Python

阅读更多关于 Downloading a web page and all of its resource files in Python

I want to be able to download a page and all of its associated resources (images, style sheets, script files, etc) using Python. I am (somewhat) familiar with urllib2 and know how to download individual urls, but before I go and start hacking at BeautifulSoup + urllib2 I wanted to be sure that there wasn't already a Python equivalent to "wget --page-requisites http://www.google.com ". Specifically I am interested in gathering statistical information about how long it takes to download an entire web page, including all resources. Thanks Mark Websucker? See http://effbot.org/zone/websucker.htm

How to debug urllib2 request that uses a basic authentication handler

阅读更多关于 How to debug urllib2 request that uses a basic authentication handler

问题 I'm making a request using urllib2 and the HTTPBasicAuthHandler like so: import urllib2 theurl = 'http://someurl.com' username = 'username' password = 'password' passman = urllib2.HTTPPasswordMgrWithDefaultRealm() passman.add_password(None, theurl, username, password) authhandler = urllib2.HTTPBasicAuthHandler(passman) opener = urllib2.build_opener(authhandler) urllib2.install_opener(opener) params = "foo=bar" response = urllib2.urlopen('http://someurl.com/somescript.cgi', params) print

How do I draw out specific data from an opened url in Python using urllib2?

阅读更多关于 How do I draw out specific data from an opened url in Python using urllib2?

I'm new to Python and am playing around with making a very basic web crawler. For instance, I have made a simple function to load a page that shows the high scores for an online game. So I am able to get the source code of the html page, but I need to draw specific numbers from that page. For instance, the webpage looks like this: http://hiscore.runescape.com/hiscorepersonal.ws?user1=bigdrizzle13 where 'bigdrizzle13' is the unique part of the link. The numbers on that page need to be drawn out and returned. Essentially, I want to build a program that all I would have to do is type in

How can I force urllib2 to time out?

阅读更多关于 How can I force urllib2 to time out?

I want to to test my application's handling of timeouts when grabbing data via urllib2, and I want to have some way to force the request to timeout. Short of finding a very very slow internet connection, what method can I use? I seem to remember an interesting application/suite for simulating these sorts of things. Maybe someone knows the link? I usually use netcat to listen on port 80 of my local machine: nc -l 80 Then I use http://localhost/ as the request URL in my application. Netcat will answer at the http port but won't ever give a response, so the request is guaranteed to time out

Python urllib2 file upload problems

阅读更多关于 Python urllib2 file upload problems

I'm currently trying to initiate a file upload with urllib2 and the urllib2_file library. Here's my code: import sys import urllib2_file import urllib2 URL='http://aquate.us/upload.php' d = [('uploaded', open(sys.argv[1:]))] req = urllib2.Request(URL, d) u = urllib2.urlopen(req) print u.read() I've placed this .py file in my My Documents directory and placed a shortcut to it in my Send To folder (the shortcut URL is ). When I right click a file, choose Send To, and select Aquate (my python), it opens a command prompt for a split second and then closes it. Nothing gets uploaded. I knew there

urllib2 with cookies

阅读更多关于 urllib2 with cookies

问题 I am trying to make a request to an RSS feed that requires a cookie, using python. I thought using urllib2 and adding the appropriate heading would be sufficient, but the request keeps saying unautherized. Im guessing it could be a problem on the remote sites' side, but wasnt sure. How do I use urllib2 along with cookies? is there a better package for this (like httplib, mechanize, curl) 回答1: import urllib2 opener = urllib2.build_opener() opener.addheaders.append(('Cookie', 'cookiename

urllib2.urlopen() vs urllib.urlopen() - urllib2 throws 404 while urllib works! WHY?

阅读更多关于 urllib2.urlopen() vs urllib.urlopen() - urllib2 throws 404 while urllib works! WHY?

import urllib print urllib.urlopen('http://www.reefgeek.com/equipment/Controllers_&_Monitors/Neptune_Systems_AquaController/Apex_Controller_&_Accessories/').read() The above script works and returns the expected results while: import urllib2 print urllib2.urlopen('http://www.reefgeek.com/equipment/Controllers_&_Monitors/Neptune_Systems_AquaController/Apex_Controller_&_Accessories/').read() throws the following error: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.5/urllib2.py", line 124, in urlopen return _opener.open(url, data) File "/usr/lib

Checking whether a link is dead or not using Python without downloading the webpage

阅读更多关于 Checking whether a link is dead or not using Python without downloading the webpage

问题 For those who know wget , it has a option --spider , which allows one to check whether a link is broke or not, without actually downloading the webpage. I would like to do the same thing in Python. My problem is that I have a list of 100'000 links I want to check, at most once a day, and at least once a week. In any case this will generate a lot of unnecessary traffic. As far as I understand from the urllib2.urlopen() documentation, it does not download the page but only the meta-information.

timeout for urllib2.urlopen() in pre Python 2.6 versions

阅读更多关于 timeout for urllib2.urlopen() in pre Python 2.6 versions

The urllib2 documentation says that timeout parameter was added in Python 2.6. Unfortunately my code base has been running on Python 2.5 and 2.4 platforms. Is there any alternate way to simulate the timeout? All I want to do is allow the code to talk the remote server for a fixed amount of time. Perhaps any alternative built-in library? (Don't want install 3rd party, like pycurl) you can set a global timeout for all socket operations (including HTTP requests) by using: socket.setdefaulttimeout() like this: import urllib2 import socket socket.setdefaulttimeout(30) f = urllib2.urlopen('http:/