scrapy | 易学教程

How to increase Scrapy crawling speed?

阅读更多关于 How to increase Scrapy crawling speed?

问题 I am using Scrapy to crawl websites and extract data to a json file, but I've found that for some sites the crawler takes ages to crawl the complete website. My question is: How can I minimize the time taken to crawl? 回答1: Try tuning the CONCURRENT_ITEMS , CONCURRENT_REQUESTS , CONCURRENT_REQUESTS_PER_DOMAIN and other settings. For full list of settings, see http://doc.scrapy.org/en/latest/topics/settings.html 来源： https://stackoverflow.com/questions/19109871/how-to-increase-scrapy-crawling

How to increase Scrapy crawling speed?

阅读更多关于 How to increase Scrapy crawling speed?

How to increase Scrapy crawling speed?

阅读更多关于 How to increase Scrapy crawling speed?

Scrapy spider_idle signal not received in my extension

阅读更多关于 Scrapy spider_idle signal not received in my extension

问题 I have common behaviour between several spiders on spider_idle signal being received, and I would like to move this behaviour into an extension. My extension already listens for spider_opened and spider_closed signals successfully. However, the spider_idle signal is not received. Here is my extension (edited for brevity): import logging import MySQLdb import MySQLdb.cursors from scrapy import signals logger = logging.getLogger(__name__) class MyExtension(object): def __init__(self, settings,

Scrapy spider_idle signal not received in my extension

阅读更多关于 Scrapy spider_idle signal not received in my extension

Scrapy spider_idle signal not received in my extension

阅读更多关于 Scrapy spider_idle signal not received in my extension

小白学 Python 爬虫（39）： JavaScript 渲染服务 scrapy-splash 入门

阅读更多关于小白学 Python 爬虫（39）： JavaScript 渲染服务 scrapy-splash 入门

人生苦短，我用 Python 前文传送门：小白学 Python 爬虫（1）：开篇小白学 Python 爬虫（2）：前置准备（一）基本类库的安装小白学 Python 爬虫（3）：前置准备（二）Linux基础入门小白学 Python 爬虫（4）：前置准备（三）Docker基础入门小白学 Python 爬虫（5）：前置准备（四）数据库基础小白学 Python 爬虫（6）：前置准备（五）爬虫框架的安装小白学 Python 爬虫（7）：HTTP 基础小白学 Python 爬虫（8）：网页基础小白学 Python 爬虫（9）：爬虫基础小白学 Python 爬虫（10）：Session 和 Cookies 小白学 Python 爬虫（11）：urllib 基础使用（一）小白学 Python 爬虫（12）：urllib 基础使用（二）小白学 Python 爬虫（13）：urllib 基础使用（三）小白学 Python 爬虫（14）：urllib 基础使用（四）小白学 Python 爬虫（15）：urllib 基础使用（五）小白学 Python 爬虫（16）：urllib 实战之爬取妹子图小白学 Python 爬虫（17）：Requests 基础使用小白学 Python 爬虫（18）：Requests 进阶操作小白学 Python 爬虫（19）：Xpath 基操

Scrapy is following and scraping non-allowed links

阅读更多关于 Scrapy is following and scraping non-allowed links

问题 I have a CrawlSpider set up to following certain links and scrape a news magazine where the links to each issue follow the following URL scheme: http://example.com/YYYY/DDDD/index.htm where YYYY is the year and DDDD is the three or four digit issue number. I only want issues 928 onwards, and have my rules below. I don't have any problem connecting to the site, crawling links, or extracting items (so I didn't include the rest of my code). The spider seems determined to follow non-allowed links

Scrapy: enabling files pipeline for absolute and relative paths?

阅读更多关于 Scrapy: enabling files pipeline for absolute and relative paths?

问题 Question : What am I missing in my code (see "Current Code" section below) that would enable me to download files from both absolute and relative paths using Scrapy? I appreciate the help. I'm feeling lost on how all of these components work together and how to get the desired behavior. Background : I've used a combination of poring over the Scrapy docs, finding comparable examples on GitHub, and trawling StackOverflow for answers, but I can't get the Scrapy files pipeline to work in the way

How to add attributes to a request in a scrapy contract

阅读更多关于 How to add attributes to a request in a scrapy contract

问题 Scrapy contract fails if we are instantiating an Item or ItemLoader with the meta attribute or the Request() object passed from a previous parse method. I was thinking of maybe overriding ScrapesContract to preprocess the request and load some dummy values in request.meta, not sure if that is good practice though. I have seen the pre_process method in the docs (illustrated in the HasHeaderContract at the bottom) to get attributes from the request object, but I'm not sure if it can be used to