urllib

How to handle response encoding from urllib.request.urlopen()

ⅰ亾dé卋堺 提交于 2019-11-26 20:23:53
I'm trying to search a webpage using regular expressions, but I'm getting the following error: TypeError: can't use a string pattern on a bytes-like object I understand why, urllib.request.urlopen() returns a bytestream and so, at least I'm guessing, re doesn't know the encoding to use. What am I supposed to do in this situation? Is there a way to specify the encoding method in a urlrequest maybe or will I need to re-encode the string myself? If so what am I looking to do, I assume I should read the encoding from the header info or the encoding type if specified in the html and then re-encode

How to catch 404 error in urllib.urlretrieve

丶灬走出姿态 提交于 2019-11-26 19:58:09
问题 Background: I am using urllib.urlretrieve, as opposed to any other function in the urllib* modules, because of the hook function support (see reporthook below) .. which is used to display a textual progress bar. This is Python >=2.6. >>> urllib.urlretrieve(url[, filename[, reporthook[, data]]]) However, urlretrieve is so dumb that it leaves no way to detect the status of the HTTP request (eg: was it 404 or 200?). >>> fn, h = urllib.urlretrieve('http://google.com/foo/bar') >>> h.items() [(

Python: Get HTTP headers from urllib2.urlopen call?

家住魔仙堡 提交于 2019-11-26 19:33:47
Does urllib2 fetch the whole page when a urlopen call is made? I'd like to just read the HTTP response header without getting the page. It looks like urllib2 opens the HTTP connection and then subsequently gets the actual HTML page... or does it just start buffering the page with the urlopen call? import urllib2 myurl = 'http://www.kidsidebyside.org/2009/05/come-and-draw-the-circle-of-unity-with-us/' page = urllib2.urlopen(myurl) // open connection, get headers html = page.readlines() // stream page tolmeda Use the response.info() method to get the headers. From the urllib2 docs : urllib2

SSL: CERTIFICATE_VERIFY_FAILED with Python3

最后都变了- 提交于 2019-11-26 19:27:30
问题 I apologize if this is a silly question, but I have been trying to teach myself how to use BeautifulSoup so that I can create a few projects. I was following this link as a tutorial: https://www.youtube.com/watch?v=5GzVNi0oTxQ After following the exact same code as him, this is the error that I get: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/urllib/request.py", line 1240, in do_open h.request(req.get_method(), req.selector, req

Get size of a file before downloading in Python

社会主义新天地 提交于 2019-11-26 18:54:02
问题 I'm downloading an entire directory from a web server. It works OK, but I can't figure how to get the file size before download to compare if it was updated on the server or not. Can this be done as if I was downloading the file from a FTP server? import urllib import re url = "http://www.someurl.com" # Download the page locally f = urllib.urlopen(url) html = f.read() f.close() f = open ("temp.htm", "w") f.write (html) f.close() # List only the .TXT / .ZIP files fnames = re.findall('^.*<a

Handling urllib2's timeout? - Python

倖福魔咒の 提交于 2019-11-26 18:28:44
I'm using the timeout parameter within the urllib2's urlopen. urllib2.urlopen('http://www.example.org', timeout=1) How do I tell Python that if the timeout expires a custom error should be raised? Any ideas? There are very few cases where you want to use except: . Doing this captures any exception, which can be hard to debug, and it captures exceptions including SystemExit and KeyboardInterupt , which can make your program annoying to use.. At the very simplest, you would catch urllib2.URLError : try: urllib2.urlopen("http://example.com", timeout = 1) except urllib2.URLError, e: raise

no module named urllib.parse (How should I install it?)

泄露秘密 提交于 2019-11-26 18:14:49
问题 I'm trying to run a REST API on CentOS 7, I read urllib.parse is in Python 3 but I'm using Python 2.7.5 so I don't know how to install this module. I installed all the requirements but still can't run the project. When I'm looking for a URL I get this (I'm using the browsable interface): Output: ImportError at /stamp/ No module named urllib.parse 回答1: If you need to write code which is Python2 and Python3 compatible you can use the following import try: from urllib.parse import urlparse

Submitting to a web form using python

喜欢而已 提交于 2019-11-26 18:13:13
问题 I have seen questions like this asked many many times but none are helpful Im trying to submit data to a form on the web ive tried requests, and urllib and none have worked for example here is code that should search for the [python] tag on SO: import urllib import urllib2 url = 'http://stackoverflow.com/' # Prepare the data values = {'q' : '[python]'} data = urllib.urlencode(values) # Send HTTP POST request req = urllib2.Request(url, data) response = urllib2.urlopen(req) html = response.read

Parse the html code for a whole webpage scrolled down

空扰寡人 提交于 2019-11-26 16:48:23
问题 from bs4 import BeautifulSoup import urllib,sys reload(sys) sys.setdefaultencoding("utf-8") r = urllib.urlopen('https://twitter.com/ndtv').read() soup = BeautifulSoup(r) This would give me not the whole web page scrolled down the end which I want but only some of it. EDIT: from selenium import webdriver from selenium.common.exceptions import StaleElementReferenceException, TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from

How to use urllib to download image from web

て烟熏妆下的殇ゞ 提交于 2019-11-26 16:48:04
问题 I'm trying to download an image using this code: from urllib import urlretrieve urlretrieve('http://gdimitriou.eu/wp-content/uploads/2008/04/google-image-search.jpg', 'google-image-search.jpg') It worked. The image was downloaded and can be open by any image viewer software. However, the code below is not working. Downloaded image is only 2KB and can't be opened by any image viewer. from urllib import urlretrieve urlretrieve('http://upload.wikimedia.org/wikipedia/en/4/44/Zindagi1976.jpg',