urllib2

How to debug urllib2 request that uses a basic authentication handler

早过忘川 提交于 2019-11-29 01:18:32
I'm making a request using urllib2 and the HTTPBasicAuthHandler like so: import urllib2 theurl = 'http://someurl.com' username = 'username' password = 'password' passman = urllib2.HTTPPasswordMgrWithDefaultRealm() passman.add_password(None, theurl, username, password) authhandler = urllib2.HTTPBasicAuthHandler(passman) opener = urllib2.build_opener(authhandler) urllib2.install_opener(opener) params = "foo=bar" response = urllib2.urlopen('http://someurl.com/somescript.cgi', params) print response.info() I'm currently getting a httplib.BadStatusLine exception when running this code. How could I

python实现的json数据以HTTP GET,POST,PUT,DELETE方式页面请求

安稳与你 提交于 2019-11-29 00:20:09
一、JSON简介 JSON(JavaScript Object Notation) 是一种轻量级的数据交换格式。易于人阅读和编写。同时也易于机器解析和生成。 它基于JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999的一个子集。 JSON采用完全独立于语言的文本格式,但是也使用了类似于C语言家族的习惯(包括C, C++, C#, Java, JavaScript, Perl, Python等)。 这些特性使JSON成为理想的数据交换语言。 二、HTTP的请求方法 HTTP/1.1协议中共定义了八种方法(有时也叫“动作”)来表明Request-URI指定的资源的不同操作方式: . OPTIONS - 返回服务器针对特定资源所支持的HTTP请求方法。 也可以利用向Web服务器发送'*'的请求来测试服务器的功能性。 . HEAD - 向服务器索要与GET请求相一致的响应,只不过响应体将不会被返回。 这一方法可以在不必传输整个响应内容的情况下,就可以获取包含在响应消息头中的元信息。 . GET - 向特定的资源发出请求。 注意:GET方法不应当被用于产生“副作用”的操作中,例如在web app.中。 其中一个原因是GET可能会被网络蜘蛛等随意访问。 . POST -

Python爬虫学习进阶

做~自己de王妃 提交于 2019-11-28 23:29:30
Python的urllib和urllib2模块都做与请求URL相关的操作,但他们提供不同的功能。他们两个最显着的差异如下: urllib2可以接受一个Request对象,并以此可以来设置一个URL的headers,但是urllib只接收一个URL。这意味着,你不能伪装你的用户代理字符串等。 urllib模块可以提供进行urlencode的方法,该方法用于GET查询字符串的生成,urllib2的不具有这样的功能。这就是urllib与urllib2经常在一起使用的原因。 #爬糗事百科段子 import urllib,urllib2 import re import sys page = 2 def getPage(page_num=1): url = "https://www.qiushibaike.com/8hr/page/" + str(page_num) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'} try: request = urllib2.Request(url, headers=headers) response = urllib2

Permanent 'Temporary failure in name resolution' after running for a number of hours

送分小仙女□ 提交于 2019-11-28 23:15:56
After running for a number of hours on Linux, my Python 2.6 program that uses urllib2, httplib and threads, starts raising this error for every request: <class 'urllib2.URLError'> URLError(gaierror(-3, 'Temporary failure in name resolution'),) If I restart the program it starts working again. My guess is some kind of resource exhaustion but I don't know how to check for it. How do I diagnose and fix the problem? This was caused by a library's failure to close connections, leading to a large number of connections stuck in a CLOSE_WAIT state. Eventually this causes the 'Temporary failure in name

How can I get the final redirect URL when using urllib2.urlopen?

不羁的心 提交于 2019-11-28 22:37:17
问题 I'm using the urllib2.urlopen method to open a URL and fetch the markup of a webpage. Some of these sites redirect me using the 301/302 redirects. I would like to know the final URL that I've been redirected to. How can I get this? 回答1: Call the .geturl() method of the file object returned. Per the urllib2 docs: geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed Example: import urllib2 response = urllib2.urlopen('http://tinyurl.com

urllib.quote() throws KeyError

纵然是瞬间 提交于 2019-11-28 22:04:21
问题 To encode the URI, I used urllib.quote("schönefeld") but when some non-ascii characters exists in string, it thorws KeyError: u'\xe9' Code: return ''.join(map(quoter, s)) My input strings are köln, brønshøj, schönefeld etc. When I tried just printing statements in windows(Using python2.7, pyscripter IDE). But in linux it raises exception (I guess platform doesn't matter). This is what I am trying: from commands import getstatusoutput queryParams = "schönefeld"; cmdString = "http://baseurl" +

Python: Clicking a button with urllib or urllib2

我是研究僧i 提交于 2019-11-28 21:40:48
I want to click a button with python, the info for the form is automatically filled by the webpage. the HTML code for sending a request to the button is: INPUT type="submit" value="Place a Bid"> How would I go about doing this? Is it possible to click the button with just urllib or urllib2? Or will I need to use something like mechanize or twill? Use the form target and send any input as post data like this: <form target="http://mysite.com/blah.php" method="GET"> ...... ...... ...... <input type="text" name="in1" value="abc"> <INPUT type="submit" value="Place a Bid"> </form> Python: # parse

Fetch data of variables inside script tag in Python or Content added from js

混江龙づ霸主 提交于 2019-11-28 21:36:01
I want to fetch data from another url for which I am using urllib and Beautiful Soup , My data is inside table tag (which I have figure out using Firefox console). But when I tried to fetch table using his id the result is None , Then I guess this table must be dynamically added via some js code. I have tried all both parsers 'lxml', 'html5lib' but still I can't get that table data. I have also tried one more thing : web = urllib.urlopen("my url") html = web.read() soup = BeautifulSoup(html, 'lxml') js = soup.find("script") ss = js.prettify() print ss Result : <script type="text/javascript">

网络数据爬取实例教程

放肆的年华 提交于 2019-11-28 21:14:04
前言 爬取数据用的类浏览器 找到我们需要的数据 使用DOM提取数据 使用正则表达式解析数据 2018年趵突泉会停止喷涌吗 URL分析 网页下载 数据解析 爬取全部数据 数据保存与检索的考量 绘制水位变化曲线图 数据分析 前言 一般而言,网络数据爬取是指基于http/https/ftp协议的数据下载——翻译成白话,就是从特定网页上获取我们需要的数据。想象一个浏览网页的过程,大致可以分为两个步骤: 在浏览器地址栏输入网址,打开这个网页 用眼睛找到我们需要的信息 事实上,从网上爬取数据的过程和我们浏览网页的过程是一样的,同样也包含这两个步骤,只是工具略有不同而已。 使用相当于“浏览器”的组件下载网址(URL)对应的网页(源码) 使用技术手段从下载的网页(源码)上找到我们需要的数据 爬取数据用的类浏览器 python有两个内置的模块urllib和urllib2,可以用来作为爬取数据用的“浏览器”,pycurl也是一个不错的选择,可以应对更复杂的要求。 我们知道,http协议共有8种方法,真正的浏览器至少支持两种请求网页的方法:GET和POST。相对于urllib2而言,urllib模块只接受字符串参数,不能指定请求数据的方法,更无法设置请求报头。因此,urllib2被视为爬取数据所用“浏览器”的首选。 这是urllib2模块最简单的应用: import urllib2 response =

Scrape a web page that requires they give you a session cookie first

社会主义新天地 提交于 2019-11-28 19:57:43
I'm trying to scrape an excel file from a government "muster roll" database. However, the URL I have to access this excel file: http://nrega.ap.gov.in/Nregs/FrontServlet?requestType=HouseholdInf_engRH&hhid=192420317026010002&actionVal=musterrolls&type=Normal requires that I have a session cookie from the government site attached to the request. How could I grab the session cookie with an initial request to the landing page (when they give you the session cookie) and then use it to hit the URL above to grab our excel file? I'm on Google App Engine using Python. I tried this: import urllib2