urllib

python save image from url

Deadly 提交于 2019-11-28 15:40:01
问题 I got a problem when I am using python to save an image from url either by urllib2 request or urllib.urlretrieve. That is the url of the image is valid. I could download it manually using the explorer. However, when I use python to download the image, the file cannot be opened. I use Mac OS preview to view the image. Thank you! UPDATE: The code is as follow def downloadImage(self): request = urllib2.Request(self.url) pic = urllib2.urlopen(request) print "downloading: " + self.url print self

Python 爬虫IP代理池的实现

旧城冷巷雨未停 提交于 2019-11-28 14:16:58
很多时候,如果要多线程的爬取网页,或者是单纯的反爬,我们需要通过代理IP来进行访问。下面看看一个基本的实现方法。 代理IP的提取,网上有很多网站都提供这个服务。基本上可靠性和银子是成正比的。国内提供的免费IP基本上都是没法用的,如果要可靠的代理只能付费;国外稍微好些,有些免费IP还是比较靠谱的。 网上随便搜索了一下,找了个网页,本来还想手动爬一些对应的IP,结果发现可以直接下载现成的txt文件 http://www.thebigproxylist.com/ 下载之后,试试看用不同的代理去爬百度首页 #!/usr/bin/env python #! -*- coding:utf-8 -*- # Author: Yuan Li import re,urllib.request fp=open("c:\\temp\\thebigproxylist-17-12-20.txt",'r') lines=fp.readlines() for ip in lines: try: print("当前代理IP "+ip) proxy=urllib.request.ProxyHandler({"http":ip}) opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install

BeautifulSoup, where are you putting my HTML?

拟墨画扇 提交于 2019-11-28 14:14:39
I'm using BS4 with python2.7. Here's the start of my code (Thanks root): from bs4 import BeautifulSoup import urllib2 f=urllib2.urlopen('http://yify-torrents.com/browse-movie') html=f.read() soup=BeautifulSoup(html) When I print html, its contents are the same as the source of the page viewed in chrome. When I print soup however, it cuts out all the entire body and leaves me with this (the contents of the head tag): <!DOCTYPE html> <html> <head> <title>Browse Movie - YIFY Torrents</title> <meta charset="utf-8"> <meta content="IE=9" http-equiv="X-UA-Compatible"/> <meta content="YIFY-Torrents

urllib cannot read https

走远了吗. 提交于 2019-11-28 11:28:29
(Python 3.4.2) Would anyone be able to help me fetch https pages with urllib? I've spent hours trying to figure this out. Here's what I'm trying to do (pretty basic): import urllib.request url = "".join((baseurl, other_string, midurl, query)) response = urllib.request.urlopen(url) html = response.read() Here's my error output when I run it: File "./script.py", line 124, in <module> response = urllib.request.urlopen(url) File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.4/urllib/request.py", line 455, in open response

Python authenticate and launch private page using webbrowser, urllib and CookieJar

萝らか妹 提交于 2019-11-28 11:07:50
问题 I want to login with cookiejar and and launch not the login page but a page that can only be seen after authenticated. I know mechanize does that but besides not working for me now, I rather do this without it. Now I have, import urllib, urllib2, cookielib, webbrowser from cookielib import CookieJar username = 'my_username' password = 'my_password' url = 'my_login_page' cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) login_data = urllib.urlencode({'my

Python 3.0 urllib.parse error “Type str doesn't support the buffer API”

你。 提交于 2019-11-28 10:44:35
File "/usr/local/lib/python3.0/cgi.py", line 477, in __init__ self.read_urlencoded() File "/usr/local/lib/python3.0/cgi.py", line 577, in read_urlencoded self.strict_parsing): File "/usr/local/lib/python3.0/urllib/parse.py", line 377, in parse_qsl pairs = [s2 for s1 in qs.split('&') for s2 in s1.split(';')] TypeError: Type str doesn't support the buffer API Can anybody direct me on how to avoid this? I'm getting it through feeding data into the cgi.Fieldstorage and I can't seem to do it any other way. urllib is trying to do: b'a,b'.split(',') Which doesn't work. byte strings and unicode

python httplib/urllib get filename

这一生的挚爱 提交于 2019-11-28 10:09:31
is there a possibillity to get the filename e.g. xyz.com/blafoo/showall.html if you work with urllib or httplib? so that i can save the file under the filename on the server? if you go to sites like xyz.com/blafoo/ you cant see the filename. Thank you To get filename from response http headers: import cgi response = urllib2.urlopen(URL) _, params = cgi.parse_header(response.headers.get('Content-Disposition', '')) filename = params['filename'] To get filename from the URL: import posixpath import urlparse path = urlparse.urlsplit(URL).path filename = posixpath.basename(path) Does not make much

爬虫学习笔记第二天(urllib库)

烈酒焚心 提交于 2019-11-28 08:20:56
1.urllib库(python自带的内置HTTP请求库):request模块(模拟发送请求);error(异常处理模块);parse(工具模块,处理URL);robotparser(识别网站的robots.txt文件)。 1.1 https://docs.python.org/3/library/urllib.request.html 官方手册。request模块:urlopen()方法,网站抓取,返回一个HTTPResponse类型的对象,该对象有read(),readinto(),getheader(name),getheaders(),fileno()等方法和msg,versuin,status,reason,debuglevel,closed等属性。,urlopen的API:urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None.capath=None,cadefault=False,context=None)>>>>>>Request()类,返回一个request类型的对象,class urllib.request.Request(url,data=None,headers={},origine_req_host=None,unverifiable=False,method=None)>>>>>

How can I un-shorten a URL using python?

别等时光非礼了梦想. 提交于 2019-11-28 07:04:35
I have seen this thread already - How can I unshorten a URL? My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshortening youtube links. Since unshort.me is used readily, this returns almost 90% of the results with captchas which I am unable to resolve. So far I am stuck with using: def unshorten_url(url): resolvedURL = urllib2.urlopen(url) print resolvedURL.url #t = Test() #c = pycurl.Curl() #c.setopt(c.URL, 'http://api.unshort.me/?r=%s&t=xml' % (url)) #c.setopt(c.WRITEFUNCTION, t.body_callback) #c.perform() #c.close() #dom = xml.dom.minidom

How to extract tables from websites in Python

杀马特。学长 韩版系。学妹 提交于 2019-11-28 05:35:26
Here, http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500 There is a table. My goal is to extract the table and save it to a csv file. I wrote a code: import urllib import os web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500") s = web.read() web.close() ff = open(r"D:\ex\python_ex\urllib\output.txt", "w") ff.write(s) ff.close() I lost from here. Anyone who can help on this? Thanks! So essentially you want to parse out html file to get elements out of it. You can use BeautifulSoup or lxml for this