urllib2 | 易学教程

Does urllib2.urlopen() actually fetch the page?

阅读更多关于 Does urllib2.urlopen() actually fetch the page?

I was condering when I use urllib2.urlopen() does it just to header reads or does it actually bring back the entire webpage? IE does the HTML page actually get fetch on the urlopen call or the read() call? handle = urllib2.urlopen(url) html = handle.read() The reason I ask is for this workflow... I have a list of urls (some of them with short url services) I only want to read the webpage if I haven't seen that url before I need to call urlopen() and use geturl() to get the final page that link goes to (after the 302 redirects) so I know if I've crawled it yet or not. I don't want to incur the

Python 2.6: parallel parsing with urllib2

阅读更多关于 Python 2.6: parallel parsing with urllib2

问题 I'm currently retrieving and parsing pages from a website using urllib2 . However, there are many of them (more than 1000), and processing them sequentially is painfully slow. I was hoping there was a way to retrieve and parse pages in a parallel fashion. If that's a good idea, is it possible, and how do I do it? Also, what are "reasonable" values for the number of pages to process in parallel (I wouldn't want to put too much strain on the server or get banned because I'm using too many

How to resume download in PYTHON, using urlretrieve function?

阅读更多关于 How to resume download in PYTHON, using urlretrieve function?

问题 Can anyone tell me how to resume a download? I'm using urlretrieve function. If there is an interruption, the download restarts from the beginning. I want the program to read the size of localfile (which I m able to do) and then resume the download from that very byte onwards. 来源： https://stackoverflow.com/questions/3581296/how-to-resume-download-in-python-using-urlretrieve-function

Python urllib2 does not respect timeout

阅读更多关于 Python urllib2 does not respect timeout

问题 The following two lines of code hangs forever: import urllib2 urllib2.urlopen('https://www.5giay.vn/', timeout=5) This is with python2.7, and I have no http_proxy or any other env variables set. Any other website works fine. I can also wget the site without any issue. What could be the issue? 回答1: If you run import urllib2 url = 'https://www.5giay.vn/' urllib2.urlopen(url, timeout=1.0) wait for a few seconds, and then use C-c to interrupt the program, you'll see File "/usr/lib/python2.7/ssl

Python urllib2 returning an empty string

阅读更多关于 Python urllib2 returning an empty string

I'm trying to retrieve the following URL: http://www.winkworth.co.uk/sale/property/flat-for-sale-in-masefield-court-london-n5/HIH140004 . import urllib2 response = urllib2.urlopen('http://www.winkworth.co.uk/rent/property/terraced-house-to-rent-in-mill-road--/WOT140129') response.read() However I'm getting an empty string. When I try it through the browser or with cURL it works fine. Any ideas what's going on? I got a response when using the requests library but not when using urllib2 , so I experimented with HTTP request headers. As it turns out, the server expects an Accept header; urllib2

Python request api is not fetching data inside table bodies

阅读更多关于 Python request api is not fetching data inside table bodies

I am trying to scrap a webpage to get table values from text data returned from requests response. </thead> <tbody class="stats"></tbody> <tbody class="annotation"></tbody> </table> </div> Actually there is some data present inside tbody classes but `I am unable to access that data using requests. Here is my code server = "http://www.ebi.ac.uk/QuickGO/GProtein" header = {'User-agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5'} payloads = {'ac':'Q9BRY0'} response = requests.get(server, params=payloads) print(response.text) #soup = BeautifulSoup

Python fetching <title>

阅读更多关于 Python fetching

问题 I want to fetch the title of a webpage which I open using urllib2. What is the best way to do this, to parse the html and find what I need (for now only the -tag but might need more in the future). Is there a good parsing lib for this purpose? 回答1: Yes I would recommend BeautifulSoup If you're getting the title it's simply: soup = BeautifulSoup(html) myTitle = soup.html.head.title or myTitle = soup('title') Taken from the documentation It's very robust and will parse the html no matter how

Unknown url type error in urllib2

阅读更多关于 Unknown url type error in urllib2

I have searched a lot of similar question on SO, but did not find an exact match to my case. I am trying to download a video using python 2.7 Here is my code for downloading the video import urllib2 from bs4 import BeautifulSoup as bs with open('video.txt','r') as f: last_downloaded_video = f.read() webpage = urllib2.urlopen('http://*.net/watch/**-'+last_downloaded_video) soup = bs(webpage) a = [] for link in soup.find_all('a'): if link.has_attr('data-video-id'): a.append(link) #try just with first data-video-id id = a[0]['data-video-id'] webpage2 = urllib2.urlopen('http://*/video/play/'+id)

https proxy support in python requests library

阅读更多关于 https proxy support in python requests library

问题 I am using the python Requests library to do HTTP related stuff. I set a proxy server using free ntlmaps on my computer to act as a proxy to answer the NTLM challenges from corporate ISA server. However, the response seems always to be empty, as shown below: >>> import requests >>> r = requests.get('https://www.google.com') >>> r.text u'<HTML></HTML>\r\n' There is no such problem in the http request though. And, when I am using urllib2 library, it can get the correct response. I compared the

Trying to get Tor to work with Python, but keep getting connection refused.?

阅读更多关于 Trying to get Tor to work with Python, but keep getting connection refused.?

问题 I've been trying to get Tor to work with Python, but I've been hitting a brick wall. I simply can't get any of the examples to work. Here is one from Stackoverflow import urllib2 proxy = urllib2.ProxyHandler({'http':'127.0.0.1:8118'}) opener = urllib2.build_opener(proxy) print opener.open('http://check.torproject.org/').read() I've installed Tor and it works fine while browsing through Aurora. However running this python script I get Traceback (most recent call last): File "/home/x/Tor.py",