urllib2 | 易学教程

gevent / requests hangs while making lots of head requests

阅读更多关于 gevent / requests hangs while making lots of head requests

I need to make 100k head requests, and I'm using gevent on top of requests. My code runs for a while, but then eventually hangs. I'm not sure why it's hanging, or whether it's hanging inside requests or gevent. I'm using the timeout argument inside both requests and gevent. Please take a look at my code snippet below, and let me know what I should change. import gevent from gevent import monkey, pool monkey.patch_all() import requests def get_head(url, timeout=3): try: return requests.head(url, allow_redirects=True, timeout=timeout) except: return None def expand_short_urls(short_urls, chunk

Batch downloading text and images from URL with Python / urllib / beautifulsoup?

阅读更多关于 Batch downloading text and images from URL with Python / urllib / beautifulsoup?

I have been browsing through several posts here but I just cannot get my head around batch-downloading images and text from a given URL with Python. import urllib,urllib2 import urlparse from BeautifulSoup import BeautifulSoup import os, sys def getAllImages(url): query = urllib2.Request(url) user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)" query.add_header("User-Agent", user_agent) page = BeautifulSoup(urllib2.urlopen(query)) for div in page.findAll("div", {"class": "thumbnail"}): print "found thumbnail" for img in div.findAll("img"):

Why does Python's urllib2.urlopen() raise an HTTPError for successful status codes?

阅读更多关于 Why does Python's urllib2.urlopen() raise an HTTPError for successful status codes?

问题 According to the urllib2 documentation, Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range. And yet the following code request = urllib2.Request(url, data, headers) response = urllib2.urlopen(request) raises an HTTPError with code 201 (created): ERROR 2011-08-11 20:40:17,318 __init__.py:463] HTTP Error 201: Created So why is urllib2 throwing HTTPErrors on this

HTTPS log in with urllib2

阅读更多关于 HTTPS log in with urllib2

I currently have a little script that downloads a webpage and extracts some data I'm interested in. Nothing fancy. Currently I'm downloading the page like so: import commands command = 'wget --output-document=- --quiet --http-user=USER --http-password=PASSWORD https://www.example.ca/page.aspx' status, text = commands.getstatusoutput(command) Although this works perfectly, I thought it'd make sense to remove the dependency on wget. I thought it should be trivial to convert the above to urllib2, but thus far I've had zero success. The Internet is full urllib2 examples, but I haven't found

debugging python web service

阅读更多关于 debugging python web service

I am using the instructions found here , to try to inspect the HTTP commands being sent to my webserver. However, I am not seeing the HTTP commands being printed on the console as suggested in the tutorial. Does anyone know how to display/debug the HTTP commands at the CLI? I am running Python 2.6.5 on Linux Ubuntu The tutorial information seems to be deprecated. Correct way to debug with urllib2 nowadays is: import urllib2 request = urllib2.Request('http://diveintomark.org/xml/atom.xml') opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1)) feeddata = opener.open(request).read()

Python Process blocked by urllib2

阅读更多关于 Python Process blocked by urllib2

I set up a process that read a queue for incoming urls to download but when urllib2 open a connection the system hangs. import urllib2, multiprocessing from threading import Thread from Queue import Queue from multiprocessing import Queue as ProcessQueue, Process def download(url): """Download a page from an url. url [str]: url to get. return [unicode]: page downloaded. """ if settings.DEBUG: print u'Downloading %s' % url request = urllib2.Request(url) response = urllib2.urlopen(request) encoding = response.headers['content-type'].split('charset=')[-1] content = unicode(response.read(),

How to POST an xml element in python

阅读更多关于 How to POST an xml element in python

问题 Basically I have this xml element (xml.etree.ElementTree) and I want to POST it to a url. Currently I'm doing something like xml_string = xml.etree.ElementTree.tostring(my_element) data = urllib.urlencode({'xml': xml_string}) response = urllib2.urlopen(url, data) I'm pretty sure that works and all, but was wondering if there is some better practice or way to do it without converting it to a string first. Thanks! 回答1: If this is your own API, I would consider POSTing as application/xml . The

python urllib2 urlopen response

阅读更多关于 python urllib2 urlopen response

python urllib2 urlopen response: <addinfourl at 1081306700 whose fp = <socket._fileobject object at 0x4073192c>> expected: {"token":"mYWmzpunvasAT795niiR"} You need to bind the resultant file-like object to a variable, otherwise the interpreter just dumps it via repr : >>> import urllib2 >>> urllib2.urlopen('http://www.google.com') <addinfourl at 18362520 whose fp = <socket._fileobject object at 0x106b250>> >>> >>> f = urllib2.urlopen('http://www.google.com') >>> f <addinfourl at 18635448 whose fp = <socket._fileobject object at 0x106b950>> To get the actual data you need to perform a read() .

Parse XML from URL into python object

阅读更多关于 Parse XML from URL into python object

问题 The goodreads website has this API for accessing a user's 'shelves:' https://www.goodreads.com/review/list/20990068.xml?key=nGvCqaQ6tn9w4HNpW8kquw&v=2&shelf=toread It returns XML. I'm trying to create a django project that shows books on a shelf from this API. I'm looking to find out how (or if there is a better way than) to write my view so I can pass an object to my template. Currently, this is what I'm doing: import urllib2 def homepage(request): file = urllib2.urlopen('https://www

Python interface to PayPal - urllib.urlencode non-ASCII characters failing

阅读更多关于 Python interface to PayPal - urllib.urlencode non-ASCII characters failing

问题 I am trying to implement PayPal IPN functionality. The basic protocol is as such: The client is redirected from my site to PayPal's site to complete payment. He logs into his account, authorizes payment. PayPal calls a page on my server passing in details as POST. Details include a person's name, address, and payment info etc. I need to call a URL on PayPal's site internally from my processing page passing back all the params that were passed in abovem and an additional one called 'cmd' with