urllib

Python: URLError: <urlopen error [Errno 10060]

China☆狼群 提交于 2019-11-26 13:48:43
OS: Windows 7; Python 2.7.3 using the Python GUI Shell I'm trying to read a website through Python, and several authors use the urllib and urllib2 libraries. To store the site in a variable, I've seen a similar approach proposed: import urllib import urllib2 g = "http://www.google.com/" read = urllib2.urlopen(g) The last line generates an error after a 120+ seconds: > Traceback (most recent call last): File "<pyshell#27>", line 1, in > <module> > r = urllib2.urlopen(o) File "C:\Python27\lib\urllib2.py", line 126, in urlopen > return _opener.open(url, data, timeout) File "C:\Python27\lib

should I call close() after urllib.urlopen()?

那年仲夏 提交于 2019-11-26 13:06:07
I'm new to Python and reading someone else's code: should urllib.urlopen() be followed by urllib.close() ? Otherwise, one would leak connections, correct? The close method must be called on the result of urllib.urlopen , not on the urllib module itself as you're thinking about (as you mention urllib.close -- which doesn't exist). The best approach: instead of x = urllib.urlopen(u) etc, use: import contextlib with contextlib.closing(urllib.urlopen(u)) as x: ...use x at will here... The with statement, and the closing context manager, will ensure proper closure even in presence of exceptions.

Changing User Agent in Python 3 for urrlib.request.urlopen

泪湿孤枕 提交于 2019-11-26 12:40:11
问题 I want to open a url using urllib.request.urlopen(\'someurl\') : with urllib.request.urlopen(\'someurl\') as url: b = url.read() I keep getting the following error: urllib.error.HTTPError: HTTP Error 403: Forbidden I understand the error to be due to the site not letting python access it, to stop bots wasting their network resources- which is understandable. I went searching and found that you need to change the user agent for urllib. However all the guides and solutions I have found for this

How do I set headers using python&#39;s urllib?

三世轮回 提交于 2019-11-26 12:37:24
问题 I am pretty new to python\'s urllib. What I need to do is set a custom header for the request being sent to the server. Specifically, I need to set the Content-type and Authorizations headers. I have looked into the python documentation, but I haven\'t been able to find it. 回答1: adding HTTP headers using urllib2: from the docs: import urllib2 req = urllib2.Request('http://www.example.com/') req.add_header('Referer', 'http://www.python.org/') resp = urllib2.urlopen(req) content = resp.read()

How to download any(!) webpage with correct charset in python?

廉价感情. 提交于 2019-11-26 11:56:56
问题 Problem When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong than your output will be messed up. People usually use some rudimentary technique to detect the encoding. They either use the charset from the header or the charset defined in the meta tag or they use an encoding detector (which does not care about meta tags or headers). By using only one these techniques, sometimes you will not get the same result as

Download Returned Zip file from URL

可紊 提交于 2019-11-26 11:56:36
问题 If I have a URL that, when submitted in a web browser, pops up a dialog box to save a zip file, how would I go about catching and downloading this zip file in Python? 回答1: Use urllib2.urlopen. The return value is a file-like object that you can read() , pass to zipfile and so on. 回答2: As far as I can tell, the proper way to do this is: import requests, zipfile, StringIO r = requests.get(zip_file_url, stream=True) z = zipfile.ZipFile(StringIO.StringIO(r.content)) z.extractall() of course you'd

How to percent-encode URL parameters in Python?

浪子不回头ぞ 提交于 2019-11-26 11:03:44
If I do url = "http://example.com?p=" + urllib.quote(query) It doesn't encode / to %2F (breaks OAuth normalization) It doesn't handle Unicode (it throws an exception) Is there a better library? Nadia Alramli From the docs : urllib.quote(string[, safe]) Replace special characters in string using the %xx escape. Letters, digits, and the characters '_.-' are never quoted. By default, this function is intended for quoting the path section of the URL.The optional safe parameter specifies additional characters that should not be quoted — its default value is '/' That means passing '' for safe will

Python爬虫爬取网页数据并存储(一)

≡放荡痞女 提交于 2019-11-26 10:26:03
Python爬虫爬取网页数据并存储(一) 环境搭建 爬虫基本原理 urllib库使用 requests库使用 正则表达式 一个示例 环境搭建 1.需要事先安装anaconda(或Python3.7)和pycharm *anaconda可在中科大镜像下下载较快 2.安装中遇到的问题: *anaconda(记得安装过程中点添加路径到path里,没添加的话手动添加: 计算机右键属性——高级系统设置——环境变量——用户/系统变量path路径中,添加 C:\Users\Aurora\Anaconda3;(anaconda安装路径)) 打开jupyter notebook ,出现页面空白解决方案: 打开 C:\Users\自己的用户名.jupyter\jupyter_notebook_config.py 在末尾输入以下代码: import webbrowser webbrowser.register(“Chrome”, None, webbrowser.GenericBrowser(u"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe")) c.NotebookApp.browser = u’Chrome’ ##浏览器名,浏览器路径(通过右键属性可以看到路径) *anaconda ,打开cmd/anaconda prompt

What should I do if socket.setdefaulttimeout() is not working?

你说的曾经没有我的故事 提交于 2019-11-26 09:38:03
问题 I\'m writing a script(multi-threaded) to retrieve contents from a website, and the site\'s not very stable so every now and then there\'s hanging http request which cannot even be time-outed by socket.setdefaulttimeout() . Since I have no control over that website, the only thing I can do is to improve my codes but I\'m running out of ideas right now. Sample codes: socket.setdefaulttimeout(150) MechBrowser = mechanize.Browser() Header = {\'User-Agent\': \'Mozilla/5.0 (Windows; U; Windows NT 5

Making a POST call instead of GET using urllib2

↘锁芯ラ 提交于 2019-11-26 09:29:47
问题 There\'s a lot of stuff out there on urllib2 and POST calls, but I\'m stuck on a problem. I\'m trying to do a simple POST call to a service: url = \'http://myserver/post_service\' data = urllib.urlencode({\'name\' : \'joe\', \'age\' : \'10\'}) content = urllib2.urlopen(url=url, data=data).read() print content I can see the server logs and it says that I\'m doing GET calls, when I\'m sending the data argument to urlopen. The library is raising an 404 error (not found), which is correct for a