url | 易学教程

用Python 爬虫批量下载PDF文档

阅读更多关于用Python 爬虫批量下载PDF文档

更新：之前代码是用 python2 写的，有关 python3 的代码可以参考这位博主的： https://blog.csdn.net/baidu_28479651/article/details/76158051 代码如下： # coding = UTF-8 # 爬取李东风PDF文档,网址：http://www.math.pku.edu.cn/teachers/lidf/docs/textrick/index.htm import urllib.request import re import os # open the url and read def getHtml(url): page = urllib.request.urlopen(url) html = page.read() page.close() return html # compile the regular expressions and find # all stuff we need def getUrl(html): reg = r'(?:href|HREF)="?((?:http://)?.+?\.pdf)' url_re = re.compile(reg) url_lst = url_re.findall(html.decode('gb2312')) return(url_lst) def

Tomcat响应https请求引起的一次Bug

阅读更多关于 Tomcat响应https请求引起的一次Bug

问题描述：接到一个需求，也是很简单。如果是Https请求重定向到一个地址，是Http请求重定向到另一个地址。代码很快写完了，开始测试。 1.Bug出现为了调试，把请求路径的日志打印出来： String url = request.getRequestURL().toString(); logger.info("url:"+url); https的请求的url中对应的端口号没了，https也看不到了，取而代之的是http请求。好吧，又开始bug修复了。 2.问题定位问题也比较容易定位，发起的请求到服务端只经过Nginx转发，路径发送变化，应该是Nginx转发出现了问题。好吧，找运维看Nginx的配置。排查后，是这个问题，那就改配置文件。问题解决后，想着本地测试一下，于是本地启动Tomcat，发起https请求，完全没任何反应？？？，反之http请求则正常响应。这是什么情况？上网搜了一下，明白了，Tomcat响应https请求得加证书。再一想上面的问题，豁然开朗了。公司的运维正常不会在服务器Tomcat加证书，因为有时候需要扩容，万一忘了咋整。赶紧问问运维，确认了猜想。同时，能够证明这个猜想的是，我发起的https请求，从来没在客户端加过证书，而且也没出现过任何提示。这样能从侧面验证猜想是正确的。 3.知识整理 HTTP与HTTPS的区别来源： CSDN 作者：腊

python爬虫模块理解

阅读更多关于 python爬虫模块理解

Url管理器：　　用来管理要抓取的url和已抓取的url,防止重复抓取和循环抓取，url管理器的五个最小功能: 　　　　1、添加url到容器中　　　　2、获取一个url 　　　　3、判断url是否已在容器中　　　　4、判断是否还有待爬取的url 　　　　5、将待爬取的url移到已爬取的url 网页下载器：　　网页下载器是爬虫的核心组件，它将url对应的互联网网页已html的形式保存在本地。目前有两种网页下载器，1：urllib2(python基础模块) 2:requests（第三库）　　urllib2三种下载网页的方法：　　　　1、简单方式下载　　　　2、添加data和http header 　　　　3、添加特殊场景的处理器 import http.cookiejarimport urllib.requesturl = "http://www.baidu.com"print("one")response1 = urllib.request.urlopen(url)print(response1.getcode())print (len(response1.read()))print("two")request = urllib.request.Request(url)request.add_header("user-agent","Mozilla/5.0"

前端数据交互之json&ajax

阅读更多关于前端数据交互之json&ajax

1.json 　　json是 JavaScript Object Notation 的首字母缩写，单词的意思是javascript对象表示法，这里说的json指的是类似于javascript对象的一种数据格式。　　json的作用：在不同的系统平台，或不同编程语言之间传递数据。 1.1 json数据的语法　　json数据对象类似于JavaScript中的对象，但是它的键对应的值里面是没有函数方法的，值可以是普通变量，不支持undefined，值还可以是数组或者json对象。 // json数据的对象格式： { "name":"tom", "age":18} // json数据的数组格式： ["tom",18,"programmer"] 复杂的json格式数据可以包含对象和数组的写法。 1 { 2 "name":"小明", 3 "age":200, 4 "fav":["code","eat","swim","read"], 5 "son":{ 6 "name":"小小明", 7 "age":100, 8 } 9 } 10 11 // 数组结构也可以作为json传输数据。 json数据可以保存在.json文件中，一般里面就只有一个json对象总结概述: 1. json文件的后缀是json 2. json文件一般保存一个单一的json数据对象 3.

PHP if string contains URL isolate it

阅读更多关于 PHP if string contains URL isolate it

问题 In PHP, I need to be able to figure out if a string contains a URL. If there is a URL, I need to isolate it as another separate string. For example: "SESAC showin the Love! http://twitpic.com/1uk7fi" I need to be able to isolate the URL in that string into a new string. At the same time the URL needs to be kept intact in the original string. Follow? I know this is probably really simple but it's killing me. 回答1: Something like preg_match('/[a-zA-Z]+:\/\/[0-9a-zA-Z;.\/?:@=_#&%~,+$]+/', $string

国外电商网站snapdeal爬取流程

阅读更多关于国外电商网站snapdeal爬取流程

首页爬取 1.首页获取各个目录的url 如所有优惠all_offers的其中url https://www.snapdeal.com/products/men-apparel-shirts?sort=plrty 2.访问他的url获取bcrumbLabelId是由于js简单渲染出来的在 id="labelId" value="(.*?) 中,如果取不到值该页面为一个优惠卷页面或者其他页面 3.url重新拼接 http://www.snapdeal.com/acors/json/product/get/search/{bcrumbLabelId}/0/20 ,0为起始,20为取的条数(固定) 4.会获得一个商品的页面非json,其中 <div class="jsNumberFound hidden">(.*?)</div> 获取他的总条数如果起始页码<总条数,起始页码+20>总条数获得总条数-起始页码如果起始页码>总条数返回的html页面其中一个标签内值为 5.可以通过正则或者xpath获取他的详情页的url 6.访问详情页url,返回的数据与页面显示的内容相同如果是关键字搜索 1.访问 https://www.snapdeal.com/search?keyword={搜索的内容} 2.批量 http://www.snapdeal.com/acors/json

带参数运行线程的方法

阅读更多关于带参数运行线程的方法

在多线程或单线程任务中,让线程带传入参数一直是个麻烦的问题,通常有种方法就是以类,对像的变量来传参数,这种方法理解上很简单不过在某些场合使用很麻烦,这里就不介绍了,我们主要介绍一种.NET2.0中新增加的带参数运行线程的方法,示例程序如下: ParameterizedThreadStart ParStart = new ParameterizedThreadStart(ThreadMethod); Thread myThread = new Thread(ParStart); object o = "hello"; myThread.Start(o); ThreadMethod如下: public void ThreadMethod(object ParObject) { // 程序代码 } 如果是多参数的话可以以数组或动态列表等方式装相入 object,然后使用时拆箱即可这样是不是简单多了哈,,, ----------------------------------------------------------------------------------- ----------------------------------------------------------------------------------- [转]个人认为，还是为线程创建一个单独的类

In ASP.NET, why is there UrlEncode() AND UrlPathEncode()?

阅读更多关于 In ASP.NET, why is there UrlEncode() AND UrlPathEncode()?

问题 In a recent project, I had the pleasure of troubleshooting a bug that involved images not loading when spaces were in the filename. I thought "What a simple issue, I'll UrlEncode() it!" But, NAY! Simply using UrlEncode() didn't resolve the problem. The new problem was the HttpUtilities.UrlEncode() method switched spaces ( ) to plusses ( + ) instead of %20 like the browser wanted. So file+image+name.jpg would return not-found while file%20image%20name.jpg was found correctly. Thankfully, a

phantomjs学习

阅读更多关于 phantomjs学习

PhantomJS快速入门　　本文简要介绍了PhantomJS的相关基础知识点，主要包括PhantomJS的介绍、下载与安装、HelloWorld程序、核心模块介绍等。由于鄙人才疏学浅，难免有疏漏之处，欢迎指正交流。　　1、PhantomJS是什么？　　PhantomJS是一个基于webkit的JavaScript API。它使用QtWebKit作为它核心浏览器的功能，使用webkit来编译解释执行JavaScript代码。任何你可以在基于webkit浏览器做的事情，它都能做到。它不仅是个隐形的浏览器，提供了诸如CSS选择器、支持Web标准、DOM操作、JSON、HTML5、Canvas、SVG等，同时也提供了处理文件I/O的操作，从而使你可以向操作系统读写文件等。PhantomJS的用处可谓非常广泛，诸如网络监测、网页截屏、无需浏览器的 Web 测试、页面访问自动化等。　　PhantomJS官方地址：http://phantomjs.org/。　　PhantomJS官方API：http://phantomjs.org/api/。　　PhantomJS官方示例：http://phantomjs.org/examples/。　　PhantomJS GitHub：https://github.com/ariya/phantomjs/。　　2、PhantomJS下载与安装

URL encode variable in Jmeter

阅读更多关于 URL encode variable in Jmeter

问题 I need to encode a variable in Jmeter , but it isn't a parameter. For example: URL path: /folder/guest/id;token=${token}/profile?details=yes I want to encode the ${token} variable, and only the token variable. I know that you can select encode in the parameters section, but this isn't a parameter. Does anyone know how to do this? 回答1: JMeter as of version 2.10 now includes a urlencode function. ${__urlencode(${token})} See http://jmeter.apache.org/usermanual/functions.html 回答2: The best way I

订阅 url