urllib

python爬虫常用模块

折月煮酒 提交于 2019-11-29 14:41:39
对于一些简单的爬虫,python(基于python3)有更好的第三方库来实现它,且容易上手。 Python标准库–logging模块 logging模块能够代替print函数的功能,将标准输出到日志文件保存起来,利用loggin模块可以部分替代debug re模块 正则表达式 sys模块 系统相关模块 sys.argv(返回一个列表,包含所有的命令行) sys.exit(退出程序) Python标准库–urllib模块 urllib.requset.urlioen可以打开HTTP(主要)、HTTPS、FTP、协议的URL ca 身份验证 data 以post方式提交URL时使用 url 提交网络地址(全程 前端需协议名 后端需端口 http:/192.168.1.1:80) timeout 超时时间设置 函数返回对象有三个额外的方法 geturl() 返回response的url信息 常用与url重定向 info()返回response的基本信息 getcode()返回response的状态代码 1,request urllib.request最常见的用法是直接使用urllib.request.urlopen()来发起请求,但通常这样是不规范的 一个完整的请求还应该包括headers这样的信息传递,可以这样实 通常防止爬虫被检测,我们需要规定headers,伪造爬虫头部信息

hangs on open url with urllib (python3)

荒凉一梦 提交于 2019-11-29 14:17:44
I try to open url with python3: import urllib.request fp = urllib.request.urlopen("http://lebed.com/") mybytes = fp.read() mystr = mybytes.decode("utf8") fp.close() print(mystr) But it hangs on second line. What's the reason of this problem and how to fix it? I suppose the reason is that the url does not support robot visiting a site visit. You need to fake a browser visit by sending browser headers along with your request import urllib.request url = "http://lebed.com/" req = urllib.request.Request( url, data=None, headers={ 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3)

【爬虫】python爬虫

99封情书 提交于 2019-11-29 13:07:14
爬虫章节 1.python 如何访问互联网 URL (网页地址) +lib= 》 urllib 2. 有问题查文档: python document. 3.response = urllib.request.urlopen(""www.baidu.com) html =html.decode("utf-8") 破除二进制的解码操作 4. 读取网页图片 wb: 二进制 urlopen=request + urlopen 浏览器 -- 审查元素 -- 查看 Network (python 提交 POST 表单 ) (浏览器和客户端的通信内容) GET :从服务器请求获得数据 POST :向指定服务器提交被处理的数据 8 分 45 秒点击 POST ( translate ? smartresult ) 点开之后点击 Preview 就可以看到被翻译的内容。 (1) 然后分析一下 headers 的内容 ① Status Code : 200 正常响应的意思 404 表示不正常响应 ② Request Headers 服务器。一般通过下面的 User-Agent 识别是浏览器访问还是代码访问 ③ Form Data : POST 提交的主要内容 (2)POST 需要指定的 data 格式,可以通过 parse 转换格式 (3) 代码实现翻译 POST 功能 import urllib

Python 3- How to retrieve an image from the web and display in a GUI using TKINTER?

心不动则不痛 提交于 2019-11-29 11:13:49
I want a function that, when a button is clicked, it will take an image from the web using URLLIB and display it in a GUI using TKINTER. I'm new to both URLLIB and TKINTER, so I'm having an incredibly difficult time doing this. Tried this, but it obviously doesn't work because it uses a textbox and only will display text. def __init__(self, root): self.root = root self.root.title('Image Retrieval Program') self.init_widgets() def init_widgets(self): self.btn = ttk.Button(self.root, command=self.get_url, text='Get Url', width=8) self.btn.grid(column=0, row=0, sticky='w') self.entry = ttk.Entry

Python unable to retrieve form with urllib or mechanize

让人想犯罪 __ 提交于 2019-11-29 11:00:02
I'm trying to fill out and submit a form using Python, but I'm not able to retrieve the resulting page. I've tried both mechanize and urllib/urllib2 methods to post the form, but both run into problems. The form I'm trying to retrieve is here: http://zrs.leidenuniv.nl/ul/start.php . The page is in Dutch, but this is irrelevant to my problem. It may be noteworthy that the form action redirects to http://zrs.leidenuniv.nl/ul/query.php . First of all, this is the urllib/urllib2 method I've tried: import urllib, urllib2 import socket, cookielib url = 'http://zrs.leidenuniv.nl/ul/start.php' params

Opening Local File Works with urllib but not with urllib2

梦想的初衷 提交于 2019-11-29 10:56:01
问题 I'm trying to open a local file using urllib2. How can I go about doing this? When I try the following line with urllib: resp = urllib.urlopen(url) it works correctly, but when I switch it to: resp = urllib2.urlopen(url) I get: ValueError: unknown url type: /path/to/file where that file definitely does exit. Thanks! 回答1: Just put "file://" in front of the path >>> import urllib2 >>> urllib2.urlopen("file:///etc/debian_version").read() 'wheezy/sid\n' 回答2: In urllib.urlopen method: If the URL

how to check if the urllib2 follow a redirect?

心不动则不痛 提交于 2019-11-29 10:45:11
I've write this function: def download_mp3(url,name): opener1 = urllib2.build_opener() page1 = opener1.open(url) mp3 = page1.read() filename = name+'.mp3' fout = open(filename, 'wb') fout.write(mp3) fout.close() This function take an url and a name both as string. Then will download and save an mp3 from the url with the name of the variable name. the url is in the form http://site/download.php?id=xxxx where xxxx is the id of an mp3 if this id does not exist the site redirects me to another page. So, the question is: how Can I check if this id exist? I've tried to check if the url exist with a

Python and urllib

廉价感情. 提交于 2019-11-29 10:35:36
I'm trying to download a zip file ("tl_2008_01001_edges.zip") from an ftp census site using urllib. What form is the zip file in when I get it and how do I save it? I'm fairly new to Python and don't understand how urllib works. This is my attempt: import urllib, sys zip_file = urllib.urlretrieve("ftp://ftp2.census.gov/geo/tiger/TIGER2008/01_ALABAMA/Autauga_County/", "tl_2008_01001_edges.zip") If I know the list of ftp folders (or counties in this case), can I run through the ftp site list using the glob function? Thanks. gimel Use urllib2.urlopen() for the zip file data and directory listing.

Response time for urllib in python

此生再无相见时 提交于 2019-11-29 10:32:11
问题 I want to get response time when I use urllib . I made below code, but it is more than response time. Can I get the time using urllib or have any other method? import urllib import datetime def main(): urllist = [ "http://google.com", ] for url in urllist: opener = urllib.FancyURLopener({}) try: start = datetime.datetime.now() f = opener.open(url) end = datetime.datetime.now() diff = end - start print int(round(diff.microseconds / 1000)) except IOError, e: print 'error', url else: print f

SSL: CERTIFICATE_VERIFY_FAILED with urllib

≯℡__Kan透↙ 提交于 2019-11-29 09:27:31
问题 I'm running into trouble with the module urllib (Python 3.6). Every time I use the module, I get a page's worth of errors. what's wrong with urllib and how to fix it? import urllib.request url='https://www.goodreads.com/quotes/tag/artificial-intelligence' u1 = urllib.request.urlopen(url) print(u1) That block of code likes to spit out this mouthful of stuff: Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 1318,