urllib2

Python urllib2.urlopen() is slow, need a better way to read several urls

自闭症网瘾萝莉.ら 提交于 2019-12-17 05:46:26
问题 As the title suggests, I'm working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup. As I have to read 5-10 sites, the page takes a while to load. I'm just wondering if there's a way to read the sites all at once? Or anytricks to make it faster, like should I close the urllib2.urlopen after each read, or keep it open? Added : also, if I were to just switch over to php, would that be faster for fetching and

Python urllib2.urlopen() is slow, need a better way to read several urls

荒凉一梦 提交于 2019-12-17 05:46:10
问题 As the title suggests, I'm working on a site written in python and it makes several calls to the urllib2 module to read websites. I then parse them with BeautifulSoup. As I have to read 5-10 sites, the page takes a while to load. I'm just wondering if there's a way to read the sites all at once? Or anytricks to make it faster, like should I close the urllib2.urlopen after each read, or keep it open? Added : also, if I were to just switch over to php, would that be faster for fetching and

How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

天涯浪子 提交于 2019-12-17 05:02:02
问题 I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from the HTML using BeautifulSoup. However, when I write this text to a file (or print it on the console), it gets written in an unexpected encoding. Sample program: import urllib2 from BeautifulSoup import BeautifulSoup # Fetch URL url = 'http://www.voxnow.de/' request = urllib2.Request(url) request.add_header('Accept-Encoding', 'utf-8') # Response has UTF-8 charset header, # and HTML body which is

python: urllib2 how to send cookie with urlopen request

与世无争的帅哥 提交于 2019-12-17 03:23:01
问题 I am trying to use urllib2 to open url and to send specific cookie text to the server. E.g. I want to open site Solve chess problems, with a specific cookie, e.g. search=1. How do I do it? I am trying to do the following: import urllib2 (need to add cookie to the request somehow) urllib2.urlopen("http://chess-problems.prg") Thanks in advance 回答1: Cookie is just another HTTP header. import urllib2 opener = urllib2.build_opener() opener.addheaders.append(('Cookie', 'cookiename=cookievalue')) f

Proxy with urllib2

拜拜、爱过 提交于 2019-12-16 20:17:52
问题 I open urls with: site = urllib2.urlopen('http://google.com') And what I want to do is connect the same way with a proxy I got somewhere telling me: site = urllib2.urlopen('http://google.com', proxies={'http':'127.0.0.1'}) but that didn't work either. I know urllib2 has something like a proxy handler, but I can't recall that function. 回答1: proxy = urllib2.ProxyHandler({'http': '127.0.0.1'}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) urllib2.urlopen('http://www.google

Proxy with urllib2

两盒软妹~` 提交于 2019-12-16 20:16:34
问题 I open urls with: site = urllib2.urlopen('http://google.com') And what I want to do is connect the same way with a proxy I got somewhere telling me: site = urllib2.urlopen('http://google.com', proxies={'http':'127.0.0.1'}) but that didn't work either. I know urllib2 has something like a proxy handler, but I can't recall that function. 回答1: proxy = urllib2.ProxyHandler({'http': '127.0.0.1'}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) urllib2.urlopen('http://www.google

Python爬虫的概括以及实战

隐身守侯 提交于 2019-12-14 13:40:24
第一章主要讲解爬虫相关的知识如:http、网页、爬虫法律等,让大家对爬虫有了一个比较完善的了解和一些题外的知识点。 ​ 今天这篇文章将是我们第二章的第一篇,我们从今天开始就正式进入实战阶段,后面将会有更多的实际案例。 爬虫系列文章的第一篇,猪哥便为大家讲解了HTTP原理,很多人好奇:好好的讲爬虫和HTTP有什么关系?其实我们常说的爬虫(也叫网络爬虫)就是使用一些网络协议发起的网络请求,而目前使用最多的网络协议便是HTTP/S网络协议簇。 一、Python有哪些网络库 在真实浏览网页我们是通过鼠标点击网页然后由浏览器帮我们发起网络请求,那在Python中我们又如何发起网络请求的呢?答案当然是库,具体哪些库?猪哥给大家列一下: Python2: httplib、httplib2、urllib、urllib2、urllib3、requests Python3: httplib2、urllib、urllib3、requests Python网络请求库有点多,而且还看见网上还都有用过的,那他们之间有何关系?又该如何选择? httplib/2: 这是一个Python内置http库,但是 它是偏于底层的库,一般不直接用 。 而httplib2是一个基于httplib的第三方库,比httplib实现更完整,支持缓存、压缩等功能。 一般这两个库都用不到 ,如果需要自己 封装网络请求可能会需要用到。

Python爬虫入门四之Urllib库的高级用法

人走茶凉 提交于 2019-12-14 08:03:06
设置Headers 有些网站不会同意程序直接用上面的方式进行访问,如果识别有问题,那么站点根本不会响应,所以为了完全模拟浏览器的工作,我们需要设置一些Headers 的属性。 首先,打开我们的浏览器,调试浏览器F12,我用的是Chrome,打开网络监听,示意如下,比如知乎,点登录之后,我们会发现登陆之后界面都变化了,出现一个新的界面,实质上这个页面包含了许许多多的内容,这些内容也不是一次性就加载完成的,实质上是执行了好多次请求,一般是首先请求HTML文件,然后加载JS,CSS 等等,经过多次请求之后,网页的骨架和肌肉全了,整个网页的效果也就出来了。 拆分这些请求,我们只看一第一个请求,你可以看到,有个Request URL,还有headers,下面便是response,图片显示得不全,小伙伴们可以亲身实验一下。那么这个头中包含了许许多多是信息,有文件编码啦,压缩方式啦,请求的agent啦等等。 其中,agent就是请求的身份,如果没有写入请求身份,那么服务器不一定会响应,所以可以在headers中设置agent,例如下面的例子,这个例子只是说明了怎样设置的headers,小伙伴们看一下设置格式就好。 import urllib import urllib2 url = 'http://www.server.com/login' user_agent = 'Mozilla/4.0

Set the header in urllib.request python 3

怎甘沉沦 提交于 2019-12-14 04:07:57
问题 What is the meaning of body (refers to?) in the code below? headers = ['Content-length']=str(len(bytes(body, 'utf-8'))) return urllib.request.Request(theurl, bytes(body, 'utf-8'), headers) Sourse : BadStatusLine exception raised when returning reply from server in Python 3 来源: https://stackoverflow.com/questions/38418477/set-the-header-in-urllib-request-python-3

urllib freeze if url is too big !

拥有回忆 提交于 2019-12-14 03:58:30
问题 ok im trying to open a url using urllib but the problem is that the file is too big, so when i open the url python freezes, im also using wxpython which also freezes when i open the url my cpu goes to almost 100% when the url is opened any solutions ? is there a way i can open the url in chunks and maybe have a time.sleep(0.5) in there so it does not freeze ? this is my code : f = open("hello.txt",'wb') datatowrite = urllib.urlopen(link).read() f.write(datatowrite) f.close() Thanks 回答1: You