urllib

How to download any(!) webpage with correct charset in python?

ε祈祈猫儿з 提交于 2019-11-27 06:19:29
Problem When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong than your output will be messed up. People usually use some rudimentary technique to detect the encoding. They either use the charset from the header or the charset defined in the meta tag or they use an encoding detector (which does not care about meta tags or headers). By using only one these techniques, sometimes you will not get the same result as you would in a browser. Browsers do it this way: Meta tags always takes precedence (or xml definition)

urllib cannot read https

人盡茶涼 提交于 2019-11-27 06:09:22
问题 (Python 3.4.2) Would anyone be able to help me fetch https pages with urllib? I've spent hours trying to figure this out. Here's what I'm trying to do (pretty basic): import urllib.request url = "".join((baseurl, other_string, midurl, query)) response = urllib.request.urlopen(url) html = response.read() Here's my error output when I run it: File "./script.py", line 124, in <module> response = urllib.request.urlopen(url) File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen return

multiprocessing.pool.MaybeEncodingError: 'TypeError(“cannot serialize '_io.BufferedReader' object”,)'

时光总嘲笑我的痴心妄想 提交于 2019-11-27 05:39:01
Why does the code below work only with multiprocessing.dummy , but not with simple multiprocessing . import urllib.request #from multiprocessing.dummy import Pool #this works from multiprocessing import Pool urls = ['http://www.python.org', 'http://www.yahoo.com','http://www.scala.org', 'http://www.google.com'] if __name__ == '__main__': with Pool(5) as p: results = p.map(urllib.request.urlopen, urls) Error : Traceback (most recent call last): File "urlthreads.py", line 31, in <module> results = p.map(urllib.request.urlopen, urls) File "C:\Users\patri\Anaconda3\lib\multiprocessing\pool.py",

python httplib/urllib get filename

隐身守侯 提交于 2019-11-27 04:38:01
问题 is there a possibillity to get the filename e.g. xyz.com/blafoo/showall.html if you work with urllib or httplib? so that i can save the file under the filename on the server? if you go to sites like xyz.com/blafoo/ you cant see the filename. Thank you 回答1: To get filename from response http headers: import cgi response = urllib2.urlopen(URL) _, params = cgi.parse_header(response.headers.get('Content-Disposition', '')) filename = params['filename'] To get filename from the URL: import

Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?

对着背影说爱祢 提交于 2019-11-27 04:27:04
问题 Python's urllib.quote and urllib.unquote do not handle Unicode correctly in Python 2.6.5. This is what happens: In [5]: print urllib.unquote(urllib.quote(u'Cataño')) --------------------------------------------------------------------------- KeyError Traceback (most recent call last) /home/kkinder/<ipython console> in <module>() /usr/lib/python2.6/urllib.pyc in quote(s, safe) 1222 safe_map[c] = (c in safe) and c or ('%%%02X' % i) 1223 _safemaps[cachekey] = safe_map -> 1224 res = map(safe_map.

python3之模块urllib

与世无争的帅哥 提交于 2019-11-27 04:26:42
urllib是python内置的HTTP请求库,无需安装即可使用,它包含了4个模块: request:它是最基本的http请求模块,用来模拟发送请求 error:异常处理模块,如果出现错误可以捕获这些异常 parse:一个工具模块,提供了许多URL处理方法,如:拆分、解析、合并等 robotparser:主要用来识别网站的robots.txt文件,然后判断哪些网站可以爬 1、urllib.request.urlopen() urllib.request.urlopen(url,data=None,[timeout,],cafile=None,capath=None,cadefault=False,context=None) 请求对象,返回一个HTTPResponse类型的对象,包含的方法和属性: 方法:read()、readinto()、getheader(name)、getheaders()、fileno() 属性:msg、version、status、reason、bebuglevel、closed import urllib.request response=urllib.request.urlopen('https://www.python.org') #请求站点获得一个HTTPResponse对象 #print(response.read().decode('utf-8'

How to extract tables from websites in Python

南笙酒味 提交于 2019-11-27 04:19:38
问题 Here, http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500 There is a table. My goal is to extract the table and save it to a csv file. I wrote a code: import urllib import os web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500") s = web.read() web.close() ff = open(r"D:\ex\python_ex\urllib\output.txt", "w") ff.write(s) ff.close() I lost from here. Anyone who can help on this? Thanks! 回答1: So

Python 'requests' library - define specific DNS?

夙愿已清 提交于 2019-11-27 04:17:31
In my project I'm handling all HTTP requests with python requests library . Now, I need to query the http server using specific DNS - there are two environments, each using its own DNS, and changes are made independently. So, when the code is running, it should use DNS specific to the environment, and not the DNS specified in my internet connection. Has anyone tried this using python-requests? I've only found solution for urllib2: https://stackoverflow.com/questions/4623090/python-set-custom-dns-server-for-urllib-requests requests uses urllib3 , which ultimately uses httplib.HTTPConnection as

Python, opposite function urllib.urlencode

别说谁变了你拦得住时间么 提交于 2019-11-27 04:11:50
问题 How can I convert data after processing urllib.urlencode to dict? urllib.urldecode does not exist. 回答1: As the docs for urlencode say, The urlparse module provides the functions parse_qs() and parse_qsl() which are used to parse query strings into Python data structures. (In older Python releases, they were in the cgi module). So, for example: >>> import urllib >>> import urlparse >>> d = {'a':'b', 'c':'d'} >>> s = urllib.urlencode(d) >>> s 'a=b&c=d' >>> d1 = urlparse.parse_qs(s) >>> d1 {'a':

爬虫基础(2)

房东的猫 提交于 2019-11-27 03:48:33
Proxyhandler 处理器(代理) 使用方法: urllib.request.Proxyhandle 传入一个代理,这个代理是一个字典,字典的关键字key依赖于代理能够接受的类型(http\https), 值是一个套接字。 使用创建的 hander 以及 ruquest.build_opener() 创建一个opener。 使用opener调用open函数,发起请求 示例代码如下: from urllib import request url = “www.baidu.com” # 创建一个 handler handler = request.Proxyhandler(["http":"233.241.25.25:3375"]) # 使用handler 创建一个opener opener = request.build_opener(handler) # 使用opener打开open方法 resp = opener.open(url) print(resp.read()) http.cookiejar库之CookieJar CookieJar和HTTPCookieProcessor 我们在使用爬虫的时候,经常会用到cookie进行模拟登陆和访问。在使用urllib库做爬虫,我们需要借助 http.cookiejar 库中的 CookieJar 来实现。