scrapy

Python之Scrapy初学问题集中(一)

纵然是瞬间 提交于 2020-03-06 01:44:33
AttributeError: ‘FeedExporter’ object has no attribute ‘slot’ 解决:因为当前需要写入的文件被占用,写不进去!关闭打开的 csv文件 爬取不到数据的原因之一 解决:没有伪装浏览器,缺少一个头文件,通过浏览器,复制一个user_Agent:{ } 的信息 例如 步骤: 1、打开常用的浏览器 2、按下f12 键 之后刷新页面(按下f5),点击上标题栏中的 NetWork 3、将你复制的内容 交给自己设定的 header header = { user - agent : "Mozilla / 5.0 ( Windows NT 10.0 ; WOW64 ) AppleWebKit / 537.36 ( KHTML , like Gecko ) Chrome / 69.0 .3497 .100 Safari / 537.36 " } 3、在scrapy爬取数据中出现UnicodeEncodeError: ‘charmap’ codec can’t encode characters in position xx: character maps to 错误 解决方法: #连接数据库 self . db_conn = MySQLdb . connect ( db = db_name , host = host , user =

python3 scrapy

情到浓时终转凉″ 提交于 2020-03-05 16:21:39
scrapy 基础教程 1. 认识Scrapy: 来一张图了解一下scrapy工作流程:(这张图是在百度下载的) scrapy 各部分的功能:   1. Scrapy Engine(引擎): 负责Spider,Item Pipeline,Downloader,Scheduler 中间的通讯,信号,数据传递等   2. Scheduler(调度器): 负责接收引擎发送过来的 request 请求,并按照一定的方式进行整理队列,入队,当引擎需要时,交还给引擎   3. Downloader(下载器): 负责下载 Scrapy Engine(引擎)发送的所有 request 请求,并将其获取到的 response 交还给 Scrapy Engine(引擎),由 引擎 交给spider 来处理   4. Spider(爬虫): 它负责处理所有response,从中分析提取数据,获取item字段需要的数据,并将需要跟进的URL提交个 引擎 再次进入 Scheduler(调度器)   4. Item Pipeline(管道): 它负责处理Spider中获取到的item,并进行后期处理(详细分析,过滤,存储等)的地方   5. Downloader Middlewares(下载中间件): 你可以当作是一个可以自定义扩展下载功能的组件   6. Spider Middlewares

如何科学的抢红包:年末致富有新招,写个程序抢红包

为君一笑 提交于 2020-03-05 12:08:23
0×00 背景 今天拜读了来自IDF实验室的《如何科学的抢红包:年末致富有新招,写个程序抢红包》,自己这段时间正在学习爬虫的相关知识,对scrapy框架有所了解,就在此代码基础上加进了scrapy,利用scrapy对文章中的“0×04 爬取红包列表”进行了重写。 0×01 scrapy框架 Scrapy,Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。 Scrapy吸引人的地方在于它是一个框架,任何人都可以根据需求方便的修改。它也提供了多种类型爬虫的基类,如BaseSpider、sitemap爬虫等,最新版本又提供了web2.0爬虫的支持。 Scrach,是抓取的意思,这个Python的爬虫框架叫Scrapy,大概也是这个意思吧,就叫它:小刮刮吧。 简单的一句话:利用scrapy可以很简单的写出爬虫。 0×02 微博登入、红包可用性检查、指定红包抓取模块 这几个模板我单独放在一个weibo类中,方面后面的scrapy的调用分析 微博登入这块,可以参照 http://www.tuicool.com/articles/ziyQFrb 这篇文章,里面很详细的记录了微博登入的全过程 代码copy大牛,并在此基础上进行了简单的修改:使用requests库进行页面的请求。 #-

Missing data with Export CSV for multiple spiders / Scrapy / Pipeline

こ雲淡風輕ζ 提交于 2020-03-05 05:35:48
问题 I implemented a pipeline based on some example around here. I'm trying to export all informations for multiple spiders (launched by a single file and not in command line) on a single CSV file. However, it appears that some data (around 10%) showed into the shell aren't recorded into CSV. Is this because spiders are writing at the same time? How could I fix this into my script to collect all data in a single CSV? I'm using CrawlerProcess to launch the spiders. from scrapy import signals from

How to navigate through js/ajax(href=“#”) based pagination with Scrapy?

你。 提交于 2020-03-05 05:01:50
问题 I want to iterate through all the category urls and scrap the content from each page. Although urls = [response.xpath('//ul[@class="flexboxesmain categorieslist"]/li/a/@href').extract()[0]] in this code I have tried to fetch only the first category url but my goal is to fetch all urls and the content inside each urls. I'm using scrapy_selenium library. Selenium page source is not passing to the 'scrap_it' function. Please review my code and let me know if there's anything wrong in it. I'm new

Scrapy using CSS to extract data and excel export everything into one cell

喜欢而已 提交于 2020-03-05 03:16:11
问题 Here is the spider import scrapy import re from ..items import HomedepotSpiderItem class HomedepotcrawlSpider(scrapy.Spider): name = 'homeDepotCrawl' allowed_domains = ['homedepot.com'] start_urls = ['https://www.homedepot.com/b/ZLINE-Kitchen-and-Bath/N-5yc1vZhsy/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&storeSelection=3304,3313,3311,3310,8560&experienceName=default'] def parse(self, response): items = HomedepotSpiderItem() #get model productName = response.css('.pod-plp__description.js-podclick

Scrapy using CSS to extract data and excel export everything into one cell

元气小坏坏 提交于 2020-03-05 03:15:24
问题 Here is the spider import scrapy import re from ..items import HomedepotSpiderItem class HomedepotcrawlSpider(scrapy.Spider): name = 'homeDepotCrawl' allowed_domains = ['homedepot.com'] start_urls = ['https://www.homedepot.com/b/ZLINE-Kitchen-and-Bath/N-5yc1vZhsy/Ntk-ProductInfoMatch/Ntt-zline?NCNI-5&storeSelection=3304,3313,3311,3310,8560&experienceName=default'] def parse(self, response): items = HomedepotSpiderItem() #get model productName = response.css('.pod-plp__description.js-podclick

scrapy output item as 1 list element per row

别来无恙 提交于 2020-03-05 01:33:10
问题 New to scrapy and have looked everywhere over the past week or more for some solution to my problem. I am trying to scrape tabular data for ufc 1 at http://ufcstats.com/event-details/6420efac0578988b. My spider is working fine and it returns each item as a list of strings. For example: 'winner': ['Royce Gracie', 'Jason DeLucia', 'Royce Gracie', 'Gerard Gordeau', 'Ken Shamrock', 'Royce Gracie', 'Kevin Rosier', 'Gerard Gordeau']} When I output to csv the event winners/losers/other stats are

scrapy output item as 1 list element per row

可紊 提交于 2020-03-05 01:31:18
问题 New to scrapy and have looked everywhere over the past week or more for some solution to my problem. I am trying to scrape tabular data for ufc 1 at http://ufcstats.com/event-details/6420efac0578988b. My spider is working fine and it returns each item as a list of strings. For example: 'winner': ['Royce Gracie', 'Jason DeLucia', 'Royce Gracie', 'Gerard Gordeau', 'Ken Shamrock', 'Royce Gracie', 'Kevin Rosier', 'Gerard Gordeau']} When I output to csv the event winners/losers/other stats are

json file get's damaged while putting it into a zip archive with python

泄露秘密 提交于 2020-03-04 11:04:31
问题 After crawling a site with scrapy, I am creating a zip archive within the closing method, pulling pictures into it. Then I add a valid json file to the archive. After unzipping (on mac os x or ubuntu) the json file will show up damaged. The last item is missing. End of decompressed file: ..a46.jpg"]}, Original file: a46.jpg"]}] Code: # create zip archive with all images inside filename = '../zip/' + datetime.datetime.now().strftime ("%Y%m%d-%H%M") + '_' + name imagefolder = 'full' imagepath =