urllib2

Python+url2 爬虫技术

China☆狼群 提交于 2019-12-19 12:38:47
爬取分为,嗯,三个步骤大概,首先是用 python 的 url 库搭接网络连接部分,能够自动打开许多网页和下载它的 html,这个很简单,都是模板不用费脑子,然后是分析目标网站的 html,观察对应的要爬取的内容是怎么被包围在这些标签中的,然后是用 python 的正则表达式构建字段,从整个 html 里进行匹配,匹配成功了就输出,整个过程大致就是这样,关键是如何打开这些网址,以及如何匹配正确,就是这样。 在匹配之前,最好就是先用一个网页试着匹配一下,实验性的,不然匹配错了,输出一大堆错误的东西,也会降低效率 这回用的是 scrape 爬虫框架 这里有一点是,urllib2 现在与 urllib 合并了。。。然后如果你要用 urllib2 的话,它就是 urllib 里的 request,所以你单独倒一条: Import urllib.request as urllib2 这样就可以愉快的玩耍啦! 然后可以这么写几句话看一下: import urllib import urllib.request as urllib2 import urllib3 response = urllib2.urlopen("http://www.smpeizi.com") print(response.read()) 就两句话,但是能传出来一大堆东西。 其实上面的 urlopen 参数可以传入一个

Fetching a URL from a basic-auth protected Jenkins server with urllib2

依然范特西╮ 提交于 2019-12-19 11:22:34
问题 I'm trying to fetch a URL from a Jekins server. Until somewhat recently I was able to use the pattern described on this page (HOWTO Fetch Internet Resources Using urllib2) to create a password-manager that correctly responded to BasicAuth challenges with the user-name & password. All was fine until the Jenkins team changed their security model, and that code no longer worked. # DOES NOT WORK! import urllib2 password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm() top_level_url = "http:/

Fetching a URL from a basic-auth protected Jenkins server with urllib2

笑着哭i 提交于 2019-12-19 11:22:02
问题 I'm trying to fetch a URL from a Jekins server. Until somewhat recently I was able to use the pattern described on this page (HOWTO Fetch Internet Resources Using urllib2) to create a password-manager that correctly responded to BasicAuth challenges with the user-name & password. All was fine until the Jenkins team changed their security model, and that code no longer worked. # DOES NOT WORK! import urllib2 password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm() top_level_url = "http:/

Installing python modules through proxy

落爺英雄遲暮 提交于 2019-12-19 10:02:10
问题 I want to install a couple of python packages which use easy_install. They use the urrlib2 module in their setup script. I tried using the company proxy to let easy_install download the required packages. So to test the proxy conn I tried the following code. I dont need to supply any credentials for proxy in IE. proxy = urllib2.ProxyHandler({"http":"http://mycompanyproxy-as-in-IE:8080"}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) site = urllib2.urlopen("http://google

urllib2 python (Transfer-Encoding: chunked)

為{幸葍}努か 提交于 2019-12-19 08:17:13
问题 I used the following python code to download the html page: response = urllib2.urlopen(current_URL) msg = response.read() print msg For a page such as this one, it opens the url without error but then prints only part of the html-page! In the following lines you can find the http headers of the html-page. I think the problem is due to "Transfer-Encoding: chunked". It seems urllib2 returns only the first chunk! I have difficulties reading the remaining chunks. How I can read the remaining

【Python开发】anaconda3 安装python包

空扰寡人 提交于 2019-12-19 05:03:27
环境说明 电脑配置:win7 64位 安装版本:anaconda3 Python 3.6 参考链接 http://python.jobbole.com/86236/ (链接中有一个小点介绍了如何加速包的下载) https://stackoverflow.com/questions/38739694/install-python-package-package-missing-in-current-win-64-channels 1. 使用conda命令安装 本来想要安装包urllib2的包,但是在anaconda官网上搜索urllib2,没找到win7 64版本的,所以就下载urllib3了。打开Anaconda Prompt,输入命令 conda install urllib2 ,结果告知没有该渠道,报错如下图 解决方法: 在 Anaconda 中搜索urllib,可以看到只有部分的urllib3的包支持win-64,所以下载了conda-forge/urllib3,conda-forge就是上面错误中所说的channels 使用命令 conda install -c conda-forge urllib3 下载成功 2. 使用pip安装 anaconda3安装后,使用命令进入pip.exe所在的文件夹下(pip.exe存在annaconda3安装目录的Scripts文件夹下)

urllib downloading contents of an online directory

社会主义新天地 提交于 2019-12-19 02:28:20
问题 I'm trying to make a program that will open a directory, then use regular expressions to get the names of powerpoints and then create files locally and copy their content. When I run this it appears to work, however when I actually try to open the files they keep saying the version is wrong. from urllib.request import urlopen import re urlpath = urlopen('http://www.divms.uiowa.edu/~jni/courses/ProgrammignInCobol/presentation/') string = urlpath.read().decode('utf-8') pattern = re.compile('ch

urllib downloading contents of an online directory

不羁的心 提交于 2019-12-19 02:28:05
问题 I'm trying to make a program that will open a directory, then use regular expressions to get the names of powerpoints and then create files locally and copy their content. When I run this it appears to work, however when I actually try to open the files they keep saying the version is wrong. from urllib.request import urlopen import re urlpath = urlopen('http://www.divms.uiowa.edu/~jni/courses/ProgrammignInCobol/presentation/') string = urlpath.read().decode('utf-8') pattern = re.compile('ch

Python爬虫入门

落爺英雄遲暮 提交于 2019-12-18 20:42:36
利用python自带urllib库 1、在Python2.x中,存在的形式是urllib和urllib2,在python3.x中整合为urllib.request,一般为了使用习惯,导入时命名为urllib2:import urllib.request as urllib2 例如下面代码: >>> import urllib.request as urllib2 >>> import urllib >>> dir(urllib) ['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'error', 'parse', 'request', 'response'] >>> 来源: CSDN 作者: 博乐Bar 链接: https://blog.csdn.net/huanzx/article/details/103602267

ValueError: unknown url type in urllib2, though the url is fine if opened in a browser

我与影子孤独终老i 提交于 2019-12-18 14:12:59
问题 Basically, I am trying to download a URL using urllib2 in python. the code is the following: import urllib2 req = urllib2.Request('www.tattoo-cover.co.uk') req.add_header('User-agent','Mozilla/5.0') result = urllib2.urlopen(req) it outputs ValueError and the program crushes for the URL in the example. When I access the url in a browser, it works fine. Any ideas how to handle the problem? UPDATE: thanks for Ben James and sth the problem is detected => add 'http://' Now the question is refined: