urllib2 | 易学教程

Python标准库之urllib，urllib2

阅读更多关于 Python标准库之urllib，urllib2

urllib模块提供了一些高级接口，用于编写需要与HTTP服务器交互的客户端。典型的应用程序包括从网页抓取数据、自动化、代理、网页爬虫等。在Python 2中，urllib功能分散在几个不同的库模块中，包括urllib、urllib2、urlparse等。在Python 3中，所有功能都合并在urllib包中。 1. urlopen (url[, data[, timeout]]) 要抓取html网页，很简单 import urllib2 response=urllib2.urlopen('http://www.google.com') urlopen创建一个表示远程url的类文件对象，然后像本地文件一样操作这个类文件对象来获取远程数据。参数url表示远程数据的路径，一般是网址。如果要执行更加复杂的操作，如修改HTTP报头，可创建Request实例并当为url参数使用；参数data表示以post方式提交到url的数据，需要经过URL编码；timeout是可选的超时选项。urlopen返回一个类文件对象，他提供了如下方法： read() , readline() , readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样 info()：返回一个 mimetools.Message 对象，表示远程服务器返回的头信息 getcode

Issue scraping with Beautiful Soup

阅读更多关于 Issue scraping with Beautiful Soup

问题 I've been scraping websites before using this same technique. But with this website it seems to not work. import urllib2 from BeautifulSoup import BeautifulSoup url = "http://www.weatheronline.co.uk/weather/maps/current?LANG=en&DATE=1354104000&CONT=euro&LAND=UK&KEY=UK&SORT=1&INT=06&TYP=sonne&ART=tabelle&RUBRIK=akt&R=310&CEL=C" page=urllib2.urlopen(url).read() soup = BeautifulSoup(page) print soup In the output should be the content of the webpage but instead I am just getting this: GIF89a (it

Python Process blocked by urllib2

阅读更多关于 Python Process blocked by urllib2

问题 I set up a process that read a queue for incoming urls to download but when urllib2 open a connection the system hangs. import urllib2, multiprocessing from threading import Thread from Queue import Queue from multiprocessing import Queue as ProcessQueue, Process def download(url): """Download a page from an url. url [str]: url to get. return [unicode]: page downloaded. """ if settings.DEBUG: print u'Downloading %s' % url request = urllib2.Request(url) response = urllib2.urlopen(request)

List all files in an online directory with Python?

阅读更多关于 List all files in an online directory with Python?

问题 Hello i was just wondering i'm trying to create a python application that downloads files from the internet but at the moment it only downloads one file with the name i know... is there any way that i can get a list of files in an online directory and downloaded them? ill show you my code for downloading one file at a time, just so you know a bit about what i wan't to do. import urllib2 url = "http://cdn.primarygames.com/taxi.swf" file_name = url.split('/')[-1] u = urllib2.urlopen(url) f =

一、如何爬取链家网页房源信息

阅读更多关于一、如何爬取链家网页房源信息

由于个人安装的Python版本是2.7的，因此此后的相关代码也是该版本。爬取网页所有信息利用urllib2包来抓取网页的信息，先介绍下urllib2包的urlopen函数。 urlopen：将网页所有信息存到一个object里，我们可通过读取这个object来获得网页信息。例如，我们使用它来获取百度首页信息如下。 import urllib2 f = urllib2.urlopen( 'http://www.baidu.com' ) f.read( 100 ) 通过上面的代码我们读取了百度首页的前100个字符： ' <!DOCTYPE html>  < html > < head > < meta http-equiv = "content-type" content = "text/html;charse' 有时可能会出现编码问题导致打开的是乱码，只需修改下编码格式即可： f. read ( 100 ).decode( 'utf-8' ) 通过这种方法我们可以获得链家一个二手房首页的信息： import urllib2 url = 'http://sz.lianjia.com/ershoufang/pg' res = urllib2.urlopen(url) content = res.read().decode( 'utf-8' )

如何爬取链家网页房源信息

阅读更多关于如何爬取链家网页房源信息

由于个人安装的Python版本是2.7的，因此此后的相关代码也是该版本。爬取网页所有信息利用urllib2包来抓取网页的信息，先介绍下urllib2包的urlopen函数。 urlopen：将网页所有信息存到一个object里，我们可通过读取这个object来获得网页信息。例如，我们使用它来获取百度首页信息如下。 import urllib2 f = urllib2.urlopen( 'http://www.baidu.com' ) f.read( 100 ) 1 2 3 通过上面的代码我们读取了百度首页的前100个字符： ' <!DOCTYPE html>  < html > < head > < meta http-equiv = "content-type" content = "text/html;charse' 1 有时可能会出现编码问题导致打开的是乱码，只需修改下编码格式即可： f. read ( 100 ).decode( 'utf-8' ) 1 通过这种方法我们可以获得链家一个二手房首页的信息： import urllib2 url = 'http://sz.lianjia.com/ershoufang/pg' res = urllib2.urlopen(url) content = res.read().decode( 'utf

Get a header with Python and convert in JSON (requests - urllib2 - json)

阅读更多关于 Get a header with Python and convert in JSON (requests - urllib2 - json)

问题 I’m trying to get the header from a website, encode it in JSON to write it to a file. I’ve tried two different ways without success. FIRST with urllib2 and json import urllib2 import json host = ("https://www.python.org/") header = urllib2.urlopen(host).info() json_header = json.dumps(header) print json_header in this way I get the error: TypeError: is not JSON serializable So I try to bypass this issue by converting the object to a string -> json_header = str(header) In this way I can json

Not possible to set content-type to application/json using urllib2

阅读更多关于 Not possible to set content-type to application/json using urllib2

问题 This little baby: import urllib2 import simplejson as json opener = urllib2.build_opener() opener.addheaders.append(('Content-Type', 'application/json')) response = opener.open('http://localhost:8000',json.dumps({'a': 'b'})) Produces the following request (as seen with ngrep): sudo ngrep -q -d lo '^POST .* localhost:8000' T 127.0.0.1:51668 -> 127.0.0.1:8000 [AP] POST / HTTP/1.1..Accept-Encoding: identity..Content-Length: 10..Host: localhost:8000..Content-Type: application/x-www-form

Select radiobutton with python urllib2

阅读更多关于 Select radiobutton with python urllib2

问题 I'm trying to select a radiobuton from this form with python urllib2 and submit the form through a button: <div id="show-form2" class="show-form2"> <input type="radio" name="2" value="21"/> OPTION <br/> <input type="radio" name="2" value="22"/> OPTION <br/> <input type="radio" name="2" value="23"/> OPTION <br/> <input type="radio" name="2" value="24"/> OPTION <br/> <input type="radio" name="2" value="25"/> OPTION <br/> <input type="radio" name="2" value="26"/> OPTION <br/> <input type="radio"

urllib2 times out but doesn't close socket connection

阅读更多关于 urllib2 times out but doesn't close socket connection

问题 I'm making a python URL grabber program. For my purposes, I want it to time out really really fast, so I'm doing urllib2.urlopen("http://.../", timeout=2) Of course it times out correctly as it should. However, it doesn't bother to close the connection to the server, so the server thinks the client is still connected. How can I ask urllib2 to just close the connection after it times out? Running gc.collect() doesn't work and I'd like to not use httplib if I can't help it. The closest I can