urllib | 易学教程

Python 3.0 urllib.parse error “Type str doesn't support the buffer API”

阅读更多关于 Python 3.0 urllib.parse error “Type str doesn't support the buffer API”

问题 File "/usr/local/lib/python3.0/cgi.py", line 477, in __init__ self.read_urlencoded() File "/usr/local/lib/python3.0/cgi.py", line 577, in read_urlencoded self.strict_parsing): File "/usr/local/lib/python3.0/urllib/parse.py", line 377, in parse_qsl pairs = [s2 for s1 in qs.split('&') for s2 in s1.split(';')] TypeError: Type str doesn't support the buffer API Can anybody direct me on how to avoid this? I'm getting it through feeding data into the cgi.Fieldstorage and I can't seem to do it any

web爬虫讲解—urllib库爬虫—基础使用—超时设置—自动模拟http请求

阅读更多关于 web爬虫讲解—urllib库爬虫—基础使用—超时设置—自动模拟http请求

利用python系统自带的urllib库写简单爬虫 urlopen()获取一个URL的html源码 read()读出html源码内容 decode(“utf-8”)将字节转化成字符串 #!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html').read().decode("utf-8") print(html) <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="csrf-param" content="_csrf"> <meta name="csrf-token" content="X1pZZnpKWnQAIGkLFisPFT4jLlJNIWMHHWM6HBBnbiwPbz4/LH1pWQ=="> 正则获取页面指定内容 #!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request

web爬虫讲解—urllib库爬虫—ip代理—用户代理和ip代理结合应用

阅读更多关于 web爬虫讲解—urllib库爬虫—ip代理—用户代理和ip代理结合应用

使用IP代理 ProxyHandler()格式化IP，第一个参数，请求目标可能是http或者https,对应设置 build_opener()初始化IP install_opener()将代理IP设置成全局,当使用urlopen()请求时自动使用代理IP #!/usr/bin/env python # -*- coding: utf-8 -*- import urllib import urllib.request import random #引入随机模块文件 ip = "180.115.8.212:39109" proxy = urllib.request.ProxyHandler({"https":ip}) #格式化IP,注意：第一个参数可能是http或者https，对应设置 opener = urllib.request.build_opener(proxy,urllib.request.HTTPHandler) #初始化IP urllib.request.install_opener(opener) #将代理IP设置成全局,当使用urlopen()请求时自动使用代理IP #请求 url = "https://www.baidu.com/" data = urllib.request.urlopen(url).read().decode("utf-8") print

Changing User Agent in Python 3 for urrlib.request.urlopen

阅读更多关于 Changing User Agent in Python 3 for urrlib.request.urlopen

I want to open a url using urllib.request.urlopen('someurl') : with urllib.request.urlopen('someurl') as url: b = url.read() I keep getting the following error: urllib.error.HTTPError: HTTP Error 403: Forbidden I understand the error to be due to the site not letting python access it, to stop bots wasting their network resources- which is understandable. I went searching and found that you need to change the user agent for urllib. However all the guides and solutions I have found for this issue as to how to change the user agent have been with urllib2, and I am using python 3 so all the

How to know if urllib.urlretrieve succeeds?

阅读更多关于 How to know if urllib.urlretrieve succeeds?

urllib.urlretrieve returns silently even if the file doesn't exist on the remote http server, it just saves a html page to the named file. For example: urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg') just returns silently, even if abc.jpg doesn't exist on google.com server, the generated abc.jpg is not a valid jpg file, it's actually a html page . I guess the returned headers (a httplib.HTTPMessage instance) can be used to actually tell whether the retrieval successes or not, but I can't find any doc for httplib.HTTPMessage . Can anybody provide some information about this problem?

How do I set headers using python's urllib?

阅读更多关于 How do I set headers using python's urllib?

I am pretty new to python's urllib. What I need to do is set a custom header for the request being sent to the server. Specifically, I need to set the Content-type and Authorizations headers. I have looked into the python documentation, but I haven't been able to find it. Corey Goldberg adding HTTP headers using urllib2 : from the docs: import urllib2 req = urllib2.Request('http://www.example.com/') req.add_header('Referer', 'http://www.python.org/') resp = urllib2.urlopen(req) content = resp.read() Cees Timmerman For both Python 3 and Python 2, this works: try: from urllib.request import

Python3: JSON POST Request WITHOUT requests library

阅读更多关于 Python3: JSON POST Request WITHOUT requests library

问题 I want to send JSON encoded data to a server using only native Python libraries. I love requests but I simply can't use it because I can't use it on the machine which runs the script. I need to do it without. newConditions = {"con1":40, "con2":20, "con3":99, "con4":40, "password":"1234"} params = urllib.parse.urlencode(newConditions) params = params.encode('utf-8') req = urllib.request.Request(conditionsSetURL, data=params) urllib.request.urlopen(req) My server is a local WAMP server. I

Python error when using urllib.open

阅读更多关于 Python error when using urllib.open

问题 When I run this: import urllib feed = urllib.urlopen("http://www.yahoo.com") print feed I get this output in the interactive window (PythonWin): <addinfourl at 48213968 whose fp = <socket._fileobject object at 0x02E14070>> I'm expecting to get the source of the above URL. I know this has worked on other computers (like the ones at school) but this is on my laptop and I'm not sure what the problem is here. Also, I don't understand this error at all. What does it mean? Addinfourl? fp? Please

How can I un-shorten a URL using python?

阅读更多关于 How can I un-shorten a URL using python?

问题 I have seen this thread already - How can I unshorten a URL? My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshortening youtube links. Since unshort.me is used readily, this returns almost 90% of the results with captchas which I am unable to resolve. So far I am stuck with using: def unshorten_url(url): resolvedURL = urllib2.urlopen(url) print resolvedURL.url #t = Test() #c = pycurl.Curl() #c.setopt(c.URL, 'http://api.unshort.me/?r=%s&t=xml' %

What should I do if socket.setdefaulttimeout() is not working?

阅读更多关于 What should I do if socket.setdefaulttimeout() is not working?

I'm writing a script(multi-threaded) to retrieve contents from a website, and the site's not very stable so every now and then there's hanging http request which cannot even be time-outed by socket.setdefaulttimeout() . Since I have no control over that website, the only thing I can do is to improve my codes but I'm running out of ideas right now. Sample codes: socket.setdefaulttimeout(150) MechBrowser = mechanize.Browser() Header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)'} Url = "http://example.com"