urllib

Python 3.0 urllib.parse error “Type str doesn't support the buffer API”

浪尽此生 提交于 2019-11-27 03:46:31
问题 File "/usr/local/lib/python3.0/cgi.py", line 477, in __init__ self.read_urlencoded() File "/usr/local/lib/python3.0/cgi.py", line 577, in read_urlencoded self.strict_parsing): File "/usr/local/lib/python3.0/urllib/parse.py", line 377, in parse_qsl pairs = [s2 for s1 in qs.split('&') for s2 in s1.split(';')] TypeError: Type str doesn't support the buffer API Can anybody direct me on how to avoid this? I'm getting it through feeding data into the cgi.Fieldstorage and I can't seem to do it any

web爬虫讲解—urllib库爬虫—基础使用—超时设置—自动模拟http请求

天涯浪子 提交于 2019-11-27 03:44:56
利用python系统自带的urllib库写简单爬虫 urlopen()获取一个URL的html源码 read()读出html源码内容 decode(“utf-8”)将字节转化成字符串 #!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request html = urllib.request.urlopen('http://edu.51cto.com/course/8360.html').read().decode("utf-8") print(html) <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="csrf-param" content="_csrf"> <meta name="csrf-token" content="X1pZZnpKWnQAIGkLFisPFT4jLlJNIWMHHWM6HBBnbiwPbz4/LH1pWQ=="> 正则获取页面指定内容 #!/usr/bin/env python # -*- coding:utf-8 -*- import urllib.request

web爬虫讲解—urllib库爬虫—ip代理—用户代理和ip代理结合应用

有些话、适合烂在心里 提交于 2019-11-27 03:44:53
使用IP代理 ProxyHandler()格式化IP,第一个参数,请求目标可能是http或者https,对应设置 build_opener()初始化IP install_opener()将代理IP设置成全局,当使用urlopen()请求时自动使用代理IP #!/usr/bin/env python # -*- coding: utf-8 -*- import urllib import urllib.request import random #引入随机模块文件 ip = "180.115.8.212:39109" proxy = urllib.request.ProxyHandler({"https":ip}) #格式化IP,注意:第一个参数可能是http或者https,对应设置 opener = urllib.request.build_opener(proxy,urllib.request.HTTPHandler) #初始化IP urllib.request.install_opener(opener) #将代理IP设置成全局,当使用urlopen()请求时自动使用代理IP #请求 url = "https://www.baidu.com/" data = urllib.request.urlopen(url).read().decode("utf-8") print

Changing User Agent in Python 3 for urrlib.request.urlopen

左心房为你撑大大i 提交于 2019-11-27 03:42:49
I want to open a url using urllib.request.urlopen('someurl') : with urllib.request.urlopen('someurl') as url: b = url.read() I keep getting the following error: urllib.error.HTTPError: HTTP Error 403: Forbidden I understand the error to be due to the site not letting python access it, to stop bots wasting their network resources- which is understandable. I went searching and found that you need to change the user agent for urllib. However all the guides and solutions I have found for this issue as to how to change the user agent have been with urllib2, and I am using python 3 so all the

How to know if urllib.urlretrieve succeeds?

徘徊边缘 提交于 2019-11-27 03:32:46
urllib.urlretrieve returns silently even if the file doesn't exist on the remote http server, it just saves a html page to the named file. For example: urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg') just returns silently, even if abc.jpg doesn't exist on google.com server, the generated abc.jpg is not a valid jpg file, it's actually a html page . I guess the returned headers (a httplib.HTTPMessage instance) can be used to actually tell whether the retrieval successes or not, but I can't find any doc for httplib.HTTPMessage . Can anybody provide some information about this problem?

How do I set headers using python's urllib?

我的梦境 提交于 2019-11-27 03:06:20
I am pretty new to python's urllib. What I need to do is set a custom header for the request being sent to the server. Specifically, I need to set the Content-type and Authorizations headers. I have looked into the python documentation, but I haven't been able to find it. Corey Goldberg adding HTTP headers using urllib2 : from the docs: import urllib2 req = urllib2.Request('http://www.example.com/') req.add_header('Referer', 'http://www.python.org/') resp = urllib2.urlopen(req) content = resp.read() Cees Timmerman For both Python 3 and Python 2, this works: try: from urllib.request import

Python3: JSON POST Request WITHOUT requests library

烈酒焚心 提交于 2019-11-27 02:32:01
问题 I want to send JSON encoded data to a server using only native Python libraries. I love requests but I simply can't use it because I can't use it on the machine which runs the script. I need to do it without. newConditions = {"con1":40, "con2":20, "con3":99, "con4":40, "password":"1234"} params = urllib.parse.urlencode(newConditions) params = params.encode('utf-8') req = urllib.request.Request(conditionsSetURL, data=params) urllib.request.urlopen(req) My server is a local WAMP server. I

Python error when using urllib.open

夙愿已清 提交于 2019-11-27 02:06:35
问题 When I run this: import urllib feed = urllib.urlopen("http://www.yahoo.com") print feed I get this output in the interactive window (PythonWin): <addinfourl at 48213968 whose fp = <socket._fileobject object at 0x02E14070>> I'm expecting to get the source of the above URL. I know this has worked on other computers (like the ones at school) but this is on my laptop and I'm not sure what the problem is here. Also, I don't understand this error at all. What does it mean? Addinfourl? fp? Please

How can I un-shorten a URL using python?

我与影子孤独终老i 提交于 2019-11-27 01:42:31
问题 I have seen this thread already - How can I unshorten a URL? My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshortening youtube links. Since unshort.me is used readily, this returns almost 90% of the results with captchas which I am unable to resolve. So far I am stuck with using: def unshorten_url(url): resolvedURL = urllib2.urlopen(url) print resolvedURL.url #t = Test() #c = pycurl.Curl() #c.setopt(c.URL, 'http://api.unshort.me/?r=%s&t=xml' %

What should I do if socket.setdefaulttimeout() is not working?

大憨熊 提交于 2019-11-27 01:21:54
I'm writing a script(multi-threaded) to retrieve contents from a website, and the site's not very stable so every now and then there's hanging http request which cannot even be time-outed by socket.setdefaulttimeout() . Since I have no control over that website, the only thing I can do is to improve my codes but I'm running out of ideas right now. Sample codes: socket.setdefaulttimeout(150) MechBrowser = mechanize.Browser() Header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)'} Url = "http://example.com"