scrapy

提高Scrapy爬取效率

回眸只為那壹抹淺笑 提交于 2020-01-18 00:25:38
1.增加并发: 默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值为100,并发设置成了为100。 2.降低日志级别: 在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:LOG_LEVEL = ‘INFO’ 3.禁止cookie: 如果不是真的需要cookie,则在scrapy爬取数据时可以进制cookie从而减少CPU的使用率,提升爬取效率。在配置文件中编写:COOKIES_ENABLED = False 4.禁止重试: 对失败的HTTP进行重新请求(重试)会减慢爬取速度,因此可以禁止重试。在配置文件中编写:RETRY_ENABLED = False 5.减少下载超时: 如果对一个非常慢的链接进行爬取,减少下载超时可以能让卡住的链接快速被放弃,从而提升效率。在配置文件中进行编写:DOWNLOAD_TIMEOUT = 10 超时时间为10s    来源: https://www.cnblogs.com/yzg-14/p/12207888.html

Scrapy模拟登陆

北战南征 提交于 2020-01-18 00:07:24
1. 为什么需要模拟登陆? #获取cookie,能够爬取登陆后的页面 2. 回顾: requests是如何模拟登陆的? #1、直接携带cookies请求页面 #2、找接口发送post请求存储cookie 3. selenium是如何模拟登陆的? #找到对应的input标签,输入文字点击登录 4. 那么对于scrapy来说,也是有两个方法模拟登陆 # 1、直接携带cookie # 2、找到发送post请求的url地址,带上信息,发送请求    来源: https://www.cnblogs.com/yzg-14/p/12207953.html

通过核心API启动单个或多个scrapy爬虫

让人想犯罪 __ 提交于 2020-01-17 23:55:18
1. 可以使用API从脚本运行Scrapy,而不是运行Scrapy的典型方法scrapy crawl;Scrapy是基于Twisted异步网络库构建的,因此需要在Twisted容器内运行它,可以通过两个API来运行单个或多个爬虫 scrapy.crawler.CrawlerProcess、scrapy.crawler.CrawlerRunner。 2. 启动爬虫的的第一个实用程序是scrapy.crawler.CrawlerProcess 。该类将为您启动Twisted reactor,配置日志记录并设置关闭处理程序,此类是所有Scrapy命令使用的类。 示例运行单个爬虫: 交流群:1029344413 源码、素材学习资料import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): # Your spider definition ... process = CrawlerProcess({ 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' }) process.crawl(MySpider) process.start() # the script will block here

CSV files are empty even items are scraped from site

…衆ロ難τιáo~ 提交于 2020-01-17 20:06:46
问题 My requirement is to dump scraped items to two different csv files. I'm able to scrape the data but CSV file is empty. Could anyone please help in this regard. Below is the code for the pipeline.py file and console logs: Code for pipeline.py : # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.exporters import CsvItemExporter from scrapy

CSV files are empty even items are scraped from site

*爱你&永不变心* 提交于 2020-01-17 20:05:56
问题 My requirement is to dump scraped items to two different csv files. I'm able to scrape the data but CSV file is empty. Could anyone please help in this regard. Below is the code for the pipeline.py file and console logs: Code for pipeline.py : # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.exporters import CsvItemExporter from scrapy

CSV files are empty even items are scraped from site

这一生的挚爱 提交于 2020-01-17 20:04:53
问题 My requirement is to dump scraped items to two different csv files. I'm able to scrape the data but CSV file is empty. Could anyone please help in this regard. Below is the code for the pipeline.py file and console logs: Code for pipeline.py : # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.exporters import CsvItemExporter from scrapy

CSV files are empty even items are scraped from site

无人久伴 提交于 2020-01-17 20:04:09
问题 My requirement is to dump scraped items to two different csv files. I'm able to scrape the data but CSV file is empty. Could anyone please help in this regard. Below is the code for the pipeline.py file and console logs: Code for pipeline.py : # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.exporters import CsvItemExporter from scrapy

Pro-Football-Reference Team Stats XPath

血红的双手。 提交于 2020-01-17 05:20:10
问题 I am using the scrapy shell on this page Pittsburgh Steelers at New England Patriots - September 10th, 2015 to pull individual team stats. For example, I want to pull total yards for the away team (464) which, when inspecting the element and copying the XPath yields //*[@id="team_stats"]/tbody/tr[5]/td[1] but when I run response.xpath('//*[@id="team_stats"]/tbody/tr[5]/td[1]') nothing is returned. I noticed that this table is in a separate div from the initial data so I'm not sure if I need

Scrapy 模拟登陆知乎--抓取热点话题

这一生的挚爱 提交于 2020-01-17 05:17:34
工具准备 在开始之前,请确保 scrpay 正确安装,手头有一款简洁而强大的浏览器, 若是你有使用 postman 那就更好了。 Python 1 scrapy genspider zhihu 使用以上命令生成知乎爬虫,代码如下: Python 1 2 3 4 5 6 7 8 9 10 11 # -*- coding: utf-8 -*- import scrapy class ZhihuSpider ( scrapy . Spider ) : name = 'zhihu' allowed_domains = [ 'www.zhihu.com' ] start_urls = [ 'http://www.zhihu.com/' ] def parse ( self , response ) : pass 有一点切记,不要忘了启用 Cookies , 切记切记 : Python 1 2 # Disable cookies (enabled by default) COOKIES_ENABLED = True 模拟登陆 过程如下: 进入登录页,获取 Header 和 Cookie 信息 完善的 Header 信息能尽量伪装爬虫, 有效 Cookie 信息能迷惑知乎服务端,使其认为当前登录非首次登录,若无有效 Cookie 会遭遇验证码。 在抓取数据之前,请在浏览器中登录过知乎,这样才使得

Scrapy

只愿长相守 提交于 2020-01-17 00:53:40
1. 什么是Scrapy   Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架,非常出名,非常强悍。所谓的框架就是一个已经被集成了各种功能(高性能异步下载,队列,分布式,解析,持久化等)的具有很强通用性的项目模板。对于框架的学习,重点是要学习其框架的特性、各个功能的用法即可。 1.1 五大核心组件工作流程 2. 安装  Linux: ​ pip3 install scrapy ​ ​   Windows: ​ a. pip3 install wheel ​ b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted ​ c. 进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl ​ d. pip3 install pywin32 ​ e. pip3 install scrapy . 基础使用  1.创建项目:scrapy startproject 项目名称 ​ #C:\Users\yangzaigang>scrapy startproject pachong #New Scrapy project 'pachong', using template directory 'c:\users\yangzaigang\appdata