urllib

opening a url with urllib in python 3

北慕城南 提交于 2019-11-30 03:19:11
i'm trying to open the URL of this API from the sunlight foundation and return the data from the page in json. this is the code Ive produced, minus the parenthesis around myapikey. import urllib.request.urlopen import json urllib.request.urlopen("https://sunlightlabs.github.io/congress/legislators?api_key='(myapikey)") and im getting this error Traceback (most recent call last): File "<input>", line 1, in <module> ImportError: No module named request.urlopen what am i doing wrong? ive researched into https://docs.python.org/3/library/urllib.request.html and still no progress You need to use

“The owner of this website has banned your access based on your browser's signature” … on a url request in a python program

孤人 提交于 2019-11-30 01:37:29
问题 When doing a simple request, on python (Entought Canopy to be precise), with urllib2, the server denies me access : data = urllib.urlopen(an url i cannot post because of reputation, params) print data.read() Error: Access denied | play.pokemonshowdown.com used CloudFlare to restrict access The owner of this website (play.pokemonshowdown.com) has banned your access based on your browser's signature (14e894f5bf8d0920-ua48). This is a apparently a generic issue, so I found several clues on the

urllib.quote() throws KeyError

和自甴很熟 提交于 2019-11-30 01:17:22
To encode the URI, I used urllib.quote("schönefeld") but when some non-ascii characters exists in string, it thorws KeyError: u'\xe9' Code: return ''.join(map(quoter, s)) My input strings are köln, brønshøj, schönefeld etc. When I tried just printing statements in windows(Using python2.7, pyscripter IDE). But in linux it raises exception (I guess platform doesn't matter). This is what I am trying: from commands import getstatusoutput queryParams = "schönefeld"; cmdString = "http://baseurl" + quote(queryParams) print getstatusoutput(cmdString) Exploring the issue reason: in urllib.quote() ,

'module' has no attribute 'urlencode'

旧城冷巷雨未停 提交于 2019-11-30 01:06:10
When I try to follow the Python Wiki's example related to URL encoding: >>> import urllib >>> params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0}) >>> f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params) >>> print f.read() An error is raised on the second line: Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'module' object has no attribute 'urlencode' What am I missing? urllib has been split up in Python 3. The urllib.urlencode() function is now urllib.parse.urlencode() , and the urllib.urlopen() function is now urllib.request

Urllib and validation of server certificate

蓝咒 提交于 2019-11-29 21:54:50
I use python 2.6 and request Facebook API (https). I guess my service could be target of Man In The Middle attacks. I discovered this morning reading again urllib module documentation that : Citation: Warning : When opening HTTPS URLs, it is not attempted to validate the server certificate. Use at your own risk! Do you have hints / url / examples to complete a full certificate validation ? Thanks for your help You could create a urllib2 opener which can do the validation for you using a custom handler. The following code is an example that works with Python 2.7.3 . It assumes you have

Python爬虫基础

北城余情 提交于 2019-11-29 20:36:53
前言 Python非常适合用来开发网页爬虫,理由如下: 1、抓取网页本身的接口 相比与其他静态编程语言,如java,c#,c++,python抓取网页文档的接口更简洁;相比其他动态脚本语言,如perl,shell,python的urllib包提供了较为完整的访问网页文档的API。(当然ruby也是很好的选择) 此外,抓取网页有时候需要模拟浏览器的行为,很多网站对于生硬的爬虫抓取都是封杀的。这是我们需要模拟user agent的行为构造合适的请求,譬如模拟用户登陆、模拟session/cookie的存储和设置。在python里都有非常优秀的第三方包帮你搞定,如Requests,mechanize 2、网页抓取后的处理 抓取的网页通常需要处理,比如过滤html标签,提取文本等。python的beautifulsoap提供了简洁的文档处理功能,能用极短的代码完成大部分文档的处理。 其实以上功能很多语言和工具都能做,但是用python能够干得最快,最干净。 Life is short, you need python. PS:python2.x和python3.x有很大不同,本文只讨论python3.x的爬虫实现方法。 爬虫架构 架构组成 URL管理器:管理待爬取的url集合和已爬取的url集合,传送待爬取的url给网页下载器。 网页下载器(urllib):爬取url对应的网页

Making HTTP POST request

穿精又带淫゛_ 提交于 2019-11-29 20:28:35
I'm trying to make a POST request to retrieve information about a book. Here is the code that returns HTTP code: 302, Moved import httplib, urllib params = urllib.urlencode({ 'isbn' : '9780131185838', 'catalogId' : '10001', 'schoolStoreId' : '15828', 'search' : 'Search' }) headers = {"Content-type": "application/x-www-form-urlencoded", "Accept": "text/plain"} conn = httplib.HTTPConnection("bkstr.com:80") conn.request("POST", "/webapp/wcs/stores/servlet/BuybackSearch", params, headers) response = conn.getresponse() print response.status, response.reason data = response.read() conn.close() When

urlib库

萝らか妹 提交于 2019-11-29 19:54:28
  urllib库是python中最基本的网络请求库,可以模拟浏览器的行为,向指定的服务器发送请求,并可以保存服务器返回的数据。 urlopen():   urllib.request模块提供了最基本的构造http请求的方法。利用它可以模拟浏览器的一个请求发起过程,同时它还带有处理授权验证(authentication)、重定向(redirection)、浏览器Cookies以及其他内容。 这里以Python官网为例,我们来把这个网页抓下来 from urllib import request response = request.urlopen('https://www.python.org') print(response.read()) 运行结果如下    这里我们只用了两行代码,便完成了Python官网的抓取,输出了网页的源代码。得到源代码之后呢?我们想要的链接、图片地址、文本信息不就可以提取出来了吗?   接下来,看下它返回的到底是什么。利用type()方法输出响应的类型: from urllib import request response = request.urlopen('https://www.python.org') print(type(response))   输出的结果如下: <class 'http.client.HTTPResponse'>  

Python authenticate and launch private page using webbrowser, urllib and CookieJar

坚强是说给别人听的谎言 提交于 2019-11-29 17:10:45
I want to login with cookiejar and and launch not the login page but a page that can only be seen after authenticated. I know mechanize does that but besides not working for me now, I rather do this without it. Now I have, import urllib, urllib2, cookielib, webbrowser from cookielib import CookieJar username = 'my_username' password = 'my_password' url = 'my_login_page' cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) login_data = urllib.urlencode({'my_username' : username, 'my_password' : password}) opener.open(url, login_data) page_to_launch = 'my

urllib库爬取实例

大憨熊 提交于 2019-11-29 15:07:08
from urllib import request import random def spider(url): user_agent_list = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36" ] user_agent = random.choice(user_agent_list) print