urllib2 | 易学教程

Python: urlretrieve PDF downloading

阅读更多关于 Python: urlretrieve PDF downloading

I am using urllib's urlretrieve() function in Python in order to try to grab some pdf's from websites. It has (at least for me) stopped working and is downloading damaged data (15 KB instead of 164 KB). I have tested this with several pdf's, all with no success (ie random.pdf ). I can't seem to get it to work, and I need to be able to download pdf's for the project I am working on. Here is an example of the kind of code I am using to download the pdf's (and parse the text using pdftotext.exe ): def get_html(url): # gets html of page from Internet import os import urllib2 import urllib from

Returning Google Search To Python

阅读更多关于 Returning Google Search To Python

I am trying to get the first 20 results from a Google search. When I use urllib2.urlopen() it gives me an error and says I am forbidden. I heard that it has something to do with faking you user agent string, but i have next to no urllib2 experience and would be very grateful if anyone could help. Thanks, giodamelio You should probably just use a library that does all the hard work. xGoogle enables you to get the search results in a list From the examples, from xgoogle.search import GoogleSearch gs = GoogleSearch("quick and dirty") gs.results_per_page = 50 results = gs.get_results() Adam Matan

Basic authentication using urllib2 with python with JIRA REST api

阅读更多关于 Basic authentication using urllib2 with python with JIRA REST api

问题 I am trying to find how i can use basic authentication using urllib2 in python to get the issue KEY The JIRA REST API describes the URI's available Thanks for the sugestions, i will try it, meanwhile, i just wanted to update this with my own effort: Here is the sample python code i tried: import urllib2, sys, re, base64 from urlparse import urlparse theurl = 'http://my.rest-server.com:8080/rest/api/latest/AA-120' # if you want to run this example you'll need to supply a protected page with y

urllib2 and cookielib thread safety

阅读更多关于 urllib2 and cookielib thread safety

As far as I've been able to tell cookielib isnt thread safe; but then again the post stating so is five years old, so it might be wrong. Nevertheless, I've been wondering - If I spawn a class like this: class Acc: jar = cookielib.CookieJar() cookie = urllib2.HTTPCookieProcessor(jar) opener = urllib2.build_opener(cookie) headers = {} def __init__ (self,login,password): self.user = login self.password = password def login(self): return False # Some magic, irrelevant def fetch(self,url): req = urllib2.Request(url,None,self.headers) res = self.opener.open(req) return res.read() for each worker

get many pages with pycurl?

阅读更多关于 get many pages with pycurl?

I want to get many pages from a website, like curl "http://farmsubsidy.org/DE/browse?page=[0000-3603]" -o "de.#1" but get the pages' data in python, not disk files. Can someone please post pycurl code to do this, or fast urllib2 (not one-at-a-time) if that's possible, or else say "forget it, curl is faster and more robust" ? Thanks here is a solution based on urllib2 and threads. import urllib2 from threading import Thread BASE_URL = 'http://farmsubsidy.org/DE/browse?page=' NUM_RANGE = range(0000, 3603) THREADS = 2 def main(): for nums in split_seq(NUM_RANGE, THREADS): t = Spider(BASE_URL,

Accepting File Argument in Python (from Send To context menu)

阅读更多关于 Accepting File Argument in Python (from Send To context menu)

I'm going to start of by noting that I have next to no python experience. alt text http://www.aquate.us/u/9986423875612301299.jpg As you may know, by simply dropping a shortcut in the Send To folder on your Windows PC, you can allow a program to take a file as an argument. How would I write a python program that takes this file as an argument? And, as a bonus if anyone gets a chance -- How would I integrate that with a urllib2 to POST the file to a PHP script on my server? Thanks in advance. Edit-- also, how do I make something show up in the Sendto menu? I was under the impression that you

How to resume download in PYTHON, using urlretrieve function?

阅读更多关于 How to resume download in PYTHON, using urlretrieve function?

Can anyone tell me how to resume a download? I'm using urlretrieve function. If there is an interruption, the download restarts from the beginning. I want the program to read the size of localfile (which I m able to do) and then resume the download from that very byte onwards. 来源： https://stackoverflow.com/questions/3581296/how-to-resume-download-in-python-using-urlretrieve-function

Python 2.6: parallel parsing with urllib2

阅读更多关于 Python 2.6: parallel parsing with urllib2

I'm currently retrieving and parsing pages from a website using urllib2 . However, there are many of them (more than 1000), and processing them sequentially is painfully slow. I was hoping there was a way to retrieve and parse pages in a parallel fashion. If that's a good idea, is it possible, and how do I do it? Also, what are "reasonable" values for the number of pages to process in parallel (I wouldn't want to put too much strain on the server or get banned because I'm using too many connections)? Thanks! You can always use threads (i.e. run each download in a separate thread). For large

Python urllib2 does not respect timeout

阅读更多关于 Python urllib2 does not respect timeout

The following two lines of code hangs forever: import urllib2 urllib2.urlopen('https://www.5giay.vn/', timeout=5) This is with python2.7, and I have no http_proxy or any other env variables set. Any other website works fine. I can also wget the site without any issue. What could be the issue? unutbu If you run import urllib2 url = 'https://www.5giay.vn/' urllib2.urlopen(url, timeout=1.0) wait for a few seconds, and then use C-c to interrupt the program, you'll see File "/usr/lib/python2.7/ssl.py", line 260, in read return self._sslobj.read(len) KeyboardInterrupt This shows that the program is

debugging python web service

阅读更多关于 debugging python web service

问题 I am using the instructions found here, to try to inspect the HTTP commands being sent to my webserver. However, I am not seeing the HTTP commands being printed on the console as suggested in the tutorial. Does anyone know how to display/debug the HTTP commands at the CLI? I am running Python 2.6.5 on Linux Ubuntu 回答1: The tutorial information seems to be deprecated. Correct way to debug with urllib2 nowadays is: import urllib2 request = urllib2.Request('http://diveintomark.org/xml/atom.xml')