urllib2

Python标准库之urllib,urllib2

落爺英雄遲暮 提交于 2019-12-09 11:15:50
urllib模块提供了一些高级接口,用于编写需要与HTTP服务器交互的客户端。典型的应用程序包括从网页抓取数据、自动化、代理、网页爬虫等。 在Python 2中,urllib功能分散在几个不同的库模块中,包括urllib、urllib2、urlparse等。在Python 3中,所有功能都合并在urllib包中。 1. urlopen (url[, data[, timeout]]) 要抓取html网页,很简单 import urllib2 response=urllib2.urlopen('http://www.google.com') urlopen创建一个表示远程url的类文件对象,然后像本地文件一样操作这个类文件对象来获取远程数据。参数url表示远程数据的路径,一般是网址。如果要执行更加复杂的操作,如修改HTTP报头,可创建Request实例并当为url参数使用;参数data表示以post方式提交到url的数 据,需要经过URL编码;timeout是可选的超时选项。urlopen返回 一个类文件对象,他提供了如下方法: read() , readline() , readlines() , fileno() , close() :这些方法的使用方式与文件对象完全一样 info():返回一个 mimetools.Message 对象,表示远程服务器返回的头信息 getcode

Issue scraping with Beautiful Soup

青春壹個敷衍的年華 提交于 2019-12-09 06:22:29
问题 I've been scraping websites before using this same technique. But with this website it seems to not work. import urllib2 from BeautifulSoup import BeautifulSoup url = "http://www.weatheronline.co.uk/weather/maps/current?LANG=en&DATE=1354104000&CONT=euro&LAND=UK&KEY=UK&SORT=1&INT=06&TYP=sonne&ART=tabelle&RUBRIK=akt&R=310&CEL=C" page=urllib2.urlopen(url).read() soup = BeautifulSoup(page) print soup In the output should be the content of the webpage but instead I am just getting this: GIF89a (it

Python Process blocked by urllib2

狂风中的少年 提交于 2019-12-09 05:47:43
问题 I set up a process that read a queue for incoming urls to download but when urllib2 open a connection the system hangs. import urllib2, multiprocessing from threading import Thread from Queue import Queue from multiprocessing import Queue as ProcessQueue, Process def download(url): """Download a page from an url. url [str]: url to get. return [unicode]: page downloaded. """ if settings.DEBUG: print u'Downloading %s' % url request = urllib2.Request(url) response = urllib2.urlopen(request)

List all files in an online directory with Python?

烈酒焚心 提交于 2019-12-09 01:40:11
问题 Hello i was just wondering i'm trying to create a python application that downloads files from the internet but at the moment it only downloads one file with the name i know... is there any way that i can get a list of files in an online directory and downloaded them? ill show you my code for downloading one file at a time, just so you know a bit about what i wan't to do. import urllib2 url = "http://cdn.primarygames.com/taxi.swf" file_name = url.split('/')[-1] u = urllib2.urlopen(url) f =

一、如何爬取链家网页房源信息

北战南征 提交于 2019-12-08 22:38:14
由于个人安装的Python版本是2.7的,因此此后的相关代码也是该版本。 爬取网页所有信息 利用urllib2包来抓取网页的信息,先介绍下urllib2包的urlopen函数。 urlopen:将网页所有信息存到一个object里,我们可通过读取这个object来获得网页信息。例如,我们使用它来获取百度首页信息如下。 import urllib2 f = urllib2.urlopen( 'http://www.baidu.com' ) f.read( 100 ) 通过上面的代码我们读取了百度首页的前100个字符: ' <!DOCTYPE html> <!--STATUS OK--> < html > < head > < meta http-equiv = "content-type" content = "text/html;charse' 有时可能会出现编码问题导致打开的是乱码,只需修改下编码格式即可: f. read ( 100 ).decode( 'utf-8' ) 通过这种方法我们可以获得链家一个二手房首页的信息: import urllib2 url = 'http://sz.lianjia.com/ershoufang/pg' res = urllib2.urlopen(url) content = res.read().decode( 'utf-8' )

如何爬取链家网页房源信息

瘦欲@ 提交于 2019-12-08 22:34:50
由于个人安装的Python版本是2.7的,因此此后的相关代码也是该版本。 爬取网页所有信息 利用urllib2包来抓取网页的信息,先介绍下urllib2包的urlopen函数。 urlopen:将网页所有信息存到一个object里,我们可通过读取这个object来获得网页信息。例如,我们使用它来获取百度首页信息如下。 import urllib2 f = urllib2.urlopen( 'http://www.baidu.com' ) f.read( 100 ) 1 2 3 通过上面的代码我们读取了百度首页的前100个字符: ' <!DOCTYPE html> <!--STATUS OK--> < html > < head > < meta http-equiv = "content-type" content = "text/html;charse' 1 有时可能会出现编码问题导致打开的是乱码,只需修改下编码格式即可: f. read ( 100 ).decode( 'utf-8' ) 1 通过这种方法我们可以获得链家一个二手房首页的信息: import urllib2 url = 'http://sz.lianjia.com/ershoufang/pg' res = urllib2.urlopen(url) content = res.read().decode( 'utf

Get a header with Python and convert in JSON (requests - urllib2 - json)

浪尽此生 提交于 2019-12-08 19:27:13
问题 I’m trying to get the header from a website, encode it in JSON to write it to a file. I’ve tried two different ways without success. FIRST with urllib2 and json import urllib2 import json host = ("https://www.python.org/") header = urllib2.urlopen(host).info() json_header = json.dumps(header) print json_header in this way I get the error: TypeError: is not JSON serializable So I try to bypass this issue by converting the object to a string -> json_header = str(header) In this way I can json

Not possible to set content-type to application/json using urllib2

[亡魂溺海] 提交于 2019-12-08 16:03:03
问题 This little baby: import urllib2 import simplejson as json opener = urllib2.build_opener() opener.addheaders.append(('Content-Type', 'application/json')) response = opener.open('http://localhost:8000',json.dumps({'a': 'b'})) Produces the following request (as seen with ngrep): sudo ngrep -q -d lo '^POST .* localhost:8000' T 127.0.0.1:51668 -> 127.0.0.1:8000 [AP] POST / HTTP/1.1..Accept-Encoding: identity..Content-Length: 10..Host: localhost:8000..Content-Type: application/x-www-form

Select radiobutton with python urllib2

戏子无情 提交于 2019-12-08 13:31:33
问题 I'm trying to select a radiobuton from this form with python urllib2 and submit the form through a button: <div id="show-form2" class="show-form2"> <input type="radio" name="2" value="21"/> OPTION <br/> <input type="radio" name="2" value="22"/> OPTION <br/> <input type="radio" name="2" value="23"/> OPTION <br/> <input type="radio" name="2" value="24"/> OPTION <br/> <input type="radio" name="2" value="25"/> OPTION <br/> <input type="radio" name="2" value="26"/> OPTION <br/> <input type="radio"

urllib2 times out but doesn't close socket connection

↘锁芯ラ 提交于 2019-12-08 12:44:09
问题 I'm making a python URL grabber program. For my purposes, I want it to time out really really fast, so I'm doing urllib2.urlopen("http://.../", timeout=2) Of course it times out correctly as it should. However, it doesn't bother to close the connection to the server, so the server thinks the client is still connected. How can I ask urllib2 to just close the connection after it times out? Running gc.collect() doesn't work and I'd like to not use httplib if I can't help it. The closest I can