scrapy | 易学教程

Scrapy : create folder structure out of downloaded images based on the url from which images are being downloaded

阅读更多关于 Scrapy : create folder structure out of downloaded images based on the url from which images are being downloaded

问题 I have an array of links that define the structure of a website. While downloading images from these links, I want to simultaneously place the downloaded images in a folder structure similar to the website structure, and not just rename it (as answered in Scrapy image download how to use custom filename) My code for the same is like this: class MyImagesPipeline(ImagesPipeline): """Custom image pipeline to rename images as they are being downloaded""" page_url=None def image_key(self, url):

scraping multiple pages with scrapy

阅读更多关于 scraping multiple pages with scrapy

问题 I am trying to use scrapy to scrape a website that has several pages of information. my code is: from scrapy.spider import BaseSpider from scrapy.selector import Selector from tcgplayer1.items import Tcgplayer1Item class MySpider(BaseSpider): name = "tcg" allowed_domains = ["http://www.tcgplayer.com/"] start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"] def parse(self, response): hxs = Selector(response) titles = hxs.xpath("//div[@class='magicCard']") for title in

Scrapy突破反爬虫的限制

阅读更多关于 Scrapy突破反爬虫的限制

7-1 爬虫和反爬的对抗过程以及策略基本概念爬虫：自动获取网站数据的程序，关键是批量的获取反爬虫：使用技术手段防止爬虫程序的方法误伤：反爬技术将普通用户识别为爬虫，如果误伤过高，效果再好也不能用一般ip地址禁止是不太可能被使用的成本：反爬虫需要的人力和机器成本拦截：成功拦截爬虫，一般拦截率越高，误伤率越高初级爬虫：简单粗暴，不管服务器压力，容易弄挂网站数据保护：失控的爬虫：由于某些情况下，忘记或者无法关闭的爬虫商业竞争对手爬虫和反爬虫对抗过程挺有趣的过程 7-2 scrapy架构源码分析 Scrapy Engine: 这是引擎，负责Spiders、ItemPipeline、Downloader、Scheduler中间的通讯，信号、数据传递等等！（像不像人的身体？） Scheduler(调度器): 它负责接受引擎发送过来的requests请求，并按照一定的方式进行整理排列，入队、并等待Scrapy Engine(引擎)来请求时，交给引擎。 Downloader（下载器）：负责下载Scrapy Engine(引擎)发送的所有Requests请求，并将其获取到的Responses交还给Scrapy Engine(引擎)，由引擎交给Spiders来处理， Spiders：它负责处理所有Responses,从中分析提取数据，获取Item字段需要的数据

HTTP 403 Responses when using Python Scrapy

阅读更多关于 HTTP 403 Responses when using Python Scrapy

问题 I am using Python.org version 2.7 64 bit on Windows Vista 64 bit. I have been testing the following Scrapy code to recursively scrape all the pages at the site www.whoscored.com, which is for football statistics: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector from scrapy.item import Item from scrapy.spider import BaseSpider from scrapy import log from scrapy.cmdline import execute from

scrapy的开发流程笔记10

阅读更多关于 scrapy的开发流程笔记10

1 ，创建项目 scrapy startproject 项目名 2 ，创建spiders爬虫文件 Scrapy genspider sipider名字网址 ( 占位 ) 3 ，打开spider文件 - - > 将待爬取的url放到start_urls中 start_urls - - 起始url，scrapy启动之后，就开始下载start_urls里面的url。 scrapy将start_urls里面的url下载好后的response就交给这个方法来处理 4 ，更改settings . py中的配置 1 ，scrapy默认是遵守robots，将rotbos协议改为 False ROBOTXT_OBET = False 2 ，设置下载的请求头。 'User-Agent' : '' 'Accept' : '' 'Accept-Language' : '' 3 ，如果将cookie加入请求头，想要让他生效，还必须在打开一条配置 cookie_enabled = False 这条配置主要是将scrapy下载时自己的cookie关闭，就可以使用户自定义cookie了。 5 ，在parse方法中，验证response中是否有数据。 def parse ( self , response ) : print ( response . text ) 6 ，在item . py中定义要爬取的字段。

Using phantomjs for dynamic content with scrapy and selenium possible race condition

阅读更多关于 Using phantomjs for dynamic content with scrapy and selenium possible race condition

问题 First off, this is a follow up question from here: Change number of running spiders scrapyd I'm used phantomjs and selenium to create a downloader middleware for my scrapy project. It works well and hasn't really slowed things down when I run my spiders one at a time locally. But just recently I put a scrapyd server up on AWS. I noticed a possible race condition that seems to be causing errors and performance issues when more than one spider is running at once. I feel like the problem stems

Scrapy + splash: can't select element

阅读更多关于 Scrapy + splash: can't select element

问题 I'm learning to use scrapy with splash. As an exercise, I'm trying to visit https://www.ubereats.com/stores/, click on the address text box, enter a location and then press the Enter button to move to next page containing the restaurants available for that location. I have the following lua code: function main(splash) local url = splash.args.url assert(splash:go(url)) assert(splash:wait(5)) local element = splash:select('.base_29SQWm') local bounds = element:bounds() assert(element:mouseclick

Scrapy + splash: can't select element

阅读更多关于 Scrapy + splash: can't select element

python-scrapy: how to fetch an URL (not via following links) inside a spider?

阅读更多关于 python-scrapy: how to fetch an URL (not via following links) inside a spider?

问题 How can I have inside my spider something that will fetch some URL to extract something from a page via HtmlXPathSelector? But the URL is something I want to supply as a string inside the code, not a link to follow. I tried something like this: req = urllib2.Request('http://www.example.com/' + some_string + '/') req.add_header('User-Agent', 'Mozilla/5.0') response = urllib2.urlopen(req) hxs = HtmlXPathSelector(response) but at this moment it throws an exception with: [Failure instance:

How to target data attribute with Scrapy

阅读更多关于 How to target data attribute with Scrapy

问题 I'm using Scrapy library to crawl a webpage. But I have a problem. I do not know how to target data attribute. I have an link with data attribute and href as follows: <a data-item-name="detail-page-link" href="this-is-some-link"> What I want is the value of href . If a had class I could do it as follows: response.css('.some-class::attr(href)') But the problem is that I do not know how to target data-item-name attribute. Any advice? 回答1: Using scrapy css selector, you can do : response.css('a