scrapy

Scrapy : create folder structure out of downloaded images based on the url from which images are being downloaded

时间秒杀一切 提交于 2020-01-01 03:39:06
问题 I have an array of links that define the structure of a website. While downloading images from these links, I want to simultaneously place the downloaded images in a folder structure similar to the website structure, and not just rename it (as answered in Scrapy image download how to use custom filename) My code for the same is like this: class MyImagesPipeline(ImagesPipeline): """Custom image pipeline to rename images as they are being downloaded""" page_url=None def image_key(self, url):

scraping multiple pages with scrapy

醉酒当歌 提交于 2020-01-01 03:28:11
问题 I am trying to use scrapy to scrape a website that has several pages of information. my code is: from scrapy.spider import BaseSpider from scrapy.selector import Selector from tcgplayer1.items import Tcgplayer1Item class MySpider(BaseSpider): name = "tcg" allowed_domains = ["http://www.tcgplayer.com/"] start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"] def parse(self, response): hxs = Selector(response) titles = hxs.xpath("//div[@class='magicCard']") for title in

Scrapy突破反爬虫的限制

末鹿安然 提交于 2020-01-01 03:24:00
7-1 爬虫和反爬的对抗过程以及策略 基本概念 爬虫:自动获取网站数据的程序,关键是批量的获取 反爬虫:使用技术手段防止爬虫程序的方法 误伤:反爬技术将普通用户识别为爬虫,如果误伤过高,效果再好也不能用 一般ip地址禁止是不太可能被使用的 成本:反爬虫需要的人力和机器成本 拦截:成功拦截爬虫,一般拦截率越高,误伤率越高 初级爬虫:简单粗暴,不管服务器压力,容易弄挂网站 数据保护: 失控的爬虫:由于某些情况下,忘记或者无法关闭的爬虫 商业竞争对手 爬虫和反爬虫对抗过程 挺有趣的过程 7-2 scrapy架构源码分析 Scrapy Engine: 这是引擎,负责Spiders、ItemPipeline、Downloader、Scheduler中间的通讯,信号、数据传递等等!(像不像人的身体?) Scheduler(调度器): 它负责接受引擎发送过来的requests请求,并按照一定的方式进行整理排列,入队、并等待Scrapy Engine(引擎)来请求时,交给引擎。 Downloader(下载器):负责下载Scrapy Engine(引擎)发送的所有Requests请求,并将其获取到的Responses交还给Scrapy Engine(引擎),由引擎交给Spiders来处理, Spiders:它负责处理所有Responses,从中分析提取数据,获取Item字段需要的数据

HTTP 403 Responses when using Python Scrapy

巧了我就是萌 提交于 2020-01-01 03:11:31
问题 I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I have been testing the following Scrapy code to recursively scrape all the pages at the site www.whoscored.com, which is for football statistics: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector from scrapy.item import Item from scrapy.spider import BaseSpider from scrapy import log from scrapy.cmdline import execute from

scrapy的开发流程 笔记10

爷,独闯天下 提交于 2020-01-01 01:22:36
1 ,创建项目 scrapy startproject 项目名 2 ,创建spiders爬虫文件 Scrapy genspider sipider名字 网址 ( 占位 ) 3 ,打开spider文件 - - > 将待爬取的url放到start_urls中 start_urls - - 起始url,scrapy启动之后,就开始下载start_urls里面的url。 scrapy将start_urls里面的url下载好后的response就交给这个方法来处理 4 ,更改settings . py中的配置 1 ,scrapy默认是遵守robots,将rotbos协议改为 False ROBOTXT_OBET = False 2 ,设置下载的请求头。 'User-Agent' : '' 'Accept' : '' 'Accept-Language' : '' 3 ,如果将cookie加入请求头,想要让他生效,还必须在打开一条配置 cookie_enabled = False 这条配置主要是将scrapy下载时自己的cookie关闭,就可以使用户自定义cookie了。 5 ,在parse方法中,验证response中是否有数据。 def parse ( self , response ) : print ( response . text ) 6 ,在item . py中定义要爬取的字段。

Using phantomjs for dynamic content with scrapy and selenium possible race condition

笑着哭i 提交于 2019-12-31 22:57:14
问题 First off, this is a follow up question from here: Change number of running spiders scrapyd I'm used phantomjs and selenium to create a downloader middleware for my scrapy project. It works well and hasn't really slowed things down when I run my spiders one at a time locally. But just recently I put a scrapyd server up on AWS. I noticed a possible race condition that seems to be causing errors and performance issues when more than one spider is running at once. I feel like the problem stems

Scrapy + splash: can't select element

99封情书 提交于 2019-12-31 17:25:15
问题 I'm learning to use scrapy with splash. As an exercise, I'm trying to visit https://www.ubereats.com/stores/, click on the address text box, enter a location and then press the Enter button to move to next page containing the restaurants available for that location. I have the following lua code: function main(splash) local url = splash.args.url assert(splash:go(url)) assert(splash:wait(5)) local element = splash:select('.base_29SQWm') local bounds = element:bounds() assert(element:mouseclick

Scrapy + splash: can't select element

房东的猫 提交于 2019-12-31 17:24:38
问题 I'm learning to use scrapy with splash. As an exercise, I'm trying to visit https://www.ubereats.com/stores/, click on the address text box, enter a location and then press the Enter button to move to next page containing the restaurants available for that location. I have the following lua code: function main(splash) local url = splash.args.url assert(splash:go(url)) assert(splash:wait(5)) local element = splash:select('.base_29SQWm') local bounds = element:bounds() assert(element:mouseclick

python-scrapy: how to fetch an URL (not via following links) inside a spider?

▼魔方 西西 提交于 2019-12-31 07:07:20
问题 How can I have inside my spider something that will fetch some URL to extract something from a page via HtmlXPathSelector? But the URL is something I want to supply as a string inside the code, not a link to follow. I tried something like this: req = urllib2.Request('http://www.example.com/' + some_string + '/') req.add_header('User-Agent', 'Mozilla/5.0') response = urllib2.urlopen(req) hxs = HtmlXPathSelector(response) but at this moment it throws an exception with: [Failure instance:

How to target data attribute with Scrapy

自作多情 提交于 2019-12-31 04:17:05
问题 I'm using Scrapy library to crawl a webpage. But I have a problem. I do not know how to target data attribute. I have an link with data attribute and href as follows: <a data-item-name="detail-page-link" href="this-is-some-link"> What I want is the value of href . If a had class I could do it as follows: response.css('.some-class::attr(href)') But the problem is that I do not know how to target data-item-name attribute. Any advice? 回答1: Using scrapy css selector, you can do : response.css('a