urllib2

gevent / requests hangs while making lots of head requests

你说的曾经没有我的故事 提交于 2019-12-03 08:46:56
I need to make 100k head requests, and I'm using gevent on top of requests. My code runs for a while, but then eventually hangs. I'm not sure why it's hanging, or whether it's hanging inside requests or gevent. I'm using the timeout argument inside both requests and gevent. Please take a look at my code snippet below, and let me know what I should change. import gevent from gevent import monkey, pool monkey.patch_all() import requests def get_head(url, timeout=3): try: return requests.head(url, allow_redirects=True, timeout=timeout) except: return None def expand_short_urls(short_urls, chunk

Batch downloading text and images from URL with Python / urllib / beautifulsoup?

独自空忆成欢 提交于 2019-12-03 08:39:27
I have been browsing through several posts here but I just cannot get my head around batch-downloading images and text from a given URL with Python. import urllib,urllib2 import urlparse from BeautifulSoup import BeautifulSoup import os, sys def getAllImages(url): query = urllib2.Request(url) user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)" query.add_header("User-Agent", user_agent) page = BeautifulSoup(urllib2.urlopen(query)) for div in page.findAll("div", {"class": "thumbnail"}): print "found thumbnail" for img in div.findAll("img"):

Why does Python's urllib2.urlopen() raise an HTTPError for successful status codes?

此生再无相见时 提交于 2019-12-03 08:01:53
问题 According to the urllib2 documentation, Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range. And yet the following code request = urllib2.Request(url, data, headers) response = urllib2.urlopen(request) raises an HTTPError with code 201 (created): ERROR 2011-08-11 20:40:17,318 __init__.py:463] HTTP Error 201: Created So why is urllib2 throwing HTTPErrors on this

HTTPS log in with urllib2

与世无争的帅哥 提交于 2019-12-03 08:00:36
I currently have a little script that downloads a webpage and extracts some data I'm interested in. Nothing fancy. Currently I'm downloading the page like so: import commands command = 'wget --output-document=- --quiet --http-user=USER --http-password=PASSWORD https://www.example.ca/page.aspx' status, text = commands.getstatusoutput(command) Although this works perfectly, I thought it'd make sense to remove the dependency on wget. I thought it should be trivial to convert the above to urllib2, but thus far I've had zero success. The Internet is full urllib2 examples, but I haven't found

debugging python web service

僤鯓⒐⒋嵵緔 提交于 2019-12-03 07:04:23
I am using the instructions found here , to try to inspect the HTTP commands being sent to my webserver. However, I am not seeing the HTTP commands being printed on the console as suggested in the tutorial. Does anyone know how to display/debug the HTTP commands at the CLI? I am running Python 2.6.5 on Linux Ubuntu The tutorial information seems to be deprecated. Correct way to debug with urllib2 nowadays is: import urllib2 request = urllib2.Request('http://diveintomark.org/xml/atom.xml') opener = urllib2.build_opener(urllib2.HTTPHandler(debuglevel=1)) feeddata = opener.open(request).read()

Python Process blocked by urllib2

我只是一个虾纸丫 提交于 2019-12-03 06:49:46
I set up a process that read a queue for incoming urls to download but when urllib2 open a connection the system hangs. import urllib2, multiprocessing from threading import Thread from Queue import Queue from multiprocessing import Queue as ProcessQueue, Process def download(url): """Download a page from an url. url [str]: url to get. return [unicode]: page downloaded. """ if settings.DEBUG: print u'Downloading %s' % url request = urllib2.Request(url) response = urllib2.urlopen(request) encoding = response.headers['content-type'].split('charset=')[-1] content = unicode(response.read(),

How to POST an xml element in python

。_饼干妹妹 提交于 2019-12-03 05:09:40
问题 Basically I have this xml element (xml.etree.ElementTree) and I want to POST it to a url. Currently I'm doing something like xml_string = xml.etree.ElementTree.tostring(my_element) data = urllib.urlencode({'xml': xml_string}) response = urllib2.urlopen(url, data) I'm pretty sure that works and all, but was wondering if there is some better practice or way to do it without converting it to a string first. Thanks! 回答1: If this is your own API, I would consider POSTing as application/xml . The

python urllib2 urlopen response

大城市里の小女人 提交于 2019-12-03 04:51:04
python urllib2 urlopen response: <addinfourl at 1081306700 whose fp = <socket._fileobject object at 0x4073192c>> expected: {"token":"mYWmzpunvasAT795niiR"} You need to bind the resultant file-like object to a variable, otherwise the interpreter just dumps it via repr : >>> import urllib2 >>> urllib2.urlopen('http://www.google.com') <addinfourl at 18362520 whose fp = <socket._fileobject object at 0x106b250>> >>> >>> f = urllib2.urlopen('http://www.google.com') >>> f <addinfourl at 18635448 whose fp = <socket._fileobject object at 0x106b950>> To get the actual data you need to perform a read() .

Parse XML from URL into python object

做~自己de王妃 提交于 2019-12-03 04:39:04
问题 The goodreads website has this API for accessing a user's 'shelves:' https://www.goodreads.com/review/list/20990068.xml?key=nGvCqaQ6tn9w4HNpW8kquw&v=2&shelf=toread It returns XML. I'm trying to create a django project that shows books on a shelf from this API. I'm looking to find out how (or if there is a better way than) to write my view so I can pass an object to my template. Currently, this is what I'm doing: import urllib2 def homepage(request): file = urllib2.urlopen('https://www

Python interface to PayPal - urllib.urlencode non-ASCII characters failing

蹲街弑〆低调 提交于 2019-12-03 04:22:21
问题 I am trying to implement PayPal IPN functionality. The basic protocol is as such: The client is redirected from my site to PayPal's site to complete payment. He logs into his account, authorizes payment. PayPal calls a page on my server passing in details as POST. Details include a person's name, address, and payment info etc. I need to call a URL on PayPal's site internally from my processing page passing back all the params that were passed in abovem and an additional one called 'cmd' with