scrapy

How to increase Scrapy crawling speed?

可紊 提交于 2020-01-14 14:08:05
问题 I am using Scrapy to crawl websites and extract data to a json file, but I've found that for some sites the crawler takes ages to crawl the complete website. My question is: How can I minimize the time taken to crawl? 回答1: Try tuning the CONCURRENT_ITEMS , CONCURRENT_REQUESTS , CONCURRENT_REQUESTS_PER_DOMAIN and other settings. For full list of settings, see http://doc.scrapy.org/en/latest/topics/settings.html 来源: https://stackoverflow.com/questions/19109871/how-to-increase-scrapy-crawling

How to increase Scrapy crawling speed?

雨燕双飞 提交于 2020-01-14 14:07:36
问题 I am using Scrapy to crawl websites and extract data to a json file, but I've found that for some sites the crawler takes ages to crawl the complete website. My question is: How can I minimize the time taken to crawl? 回答1: Try tuning the CONCURRENT_ITEMS , CONCURRENT_REQUESTS , CONCURRENT_REQUESTS_PER_DOMAIN and other settings. For full list of settings, see http://doc.scrapy.org/en/latest/topics/settings.html 来源: https://stackoverflow.com/questions/19109871/how-to-increase-scrapy-crawling

How to increase Scrapy crawling speed?

孤人 提交于 2020-01-14 14:07:33
问题 I am using Scrapy to crawl websites and extract data to a json file, but I've found that for some sites the crawler takes ages to crawl the complete website. My question is: How can I minimize the time taken to crawl? 回答1: Try tuning the CONCURRENT_ITEMS , CONCURRENT_REQUESTS , CONCURRENT_REQUESTS_PER_DOMAIN and other settings. For full list of settings, see http://doc.scrapy.org/en/latest/topics/settings.html 来源: https://stackoverflow.com/questions/19109871/how-to-increase-scrapy-crawling

Scrapy spider_idle signal not received in my extension

╄→гoц情女王★ 提交于 2020-01-14 10:43:11
问题 I have common behaviour between several spiders on spider_idle signal being received, and I would like to move this behaviour into an extension. My extension already listens for spider_opened and spider_closed signals successfully. However, the spider_idle signal is not received. Here is my extension (edited for brevity): import logging import MySQLdb import MySQLdb.cursors from scrapy import signals logger = logging.getLogger(__name__) class MyExtension(object): def __init__(self, settings,

Scrapy spider_idle signal not received in my extension

心已入冬 提交于 2020-01-14 10:42:28
问题 I have common behaviour between several spiders on spider_idle signal being received, and I would like to move this behaviour into an extension. My extension already listens for spider_opened and spider_closed signals successfully. However, the spider_idle signal is not received. Here is my extension (edited for brevity): import logging import MySQLdb import MySQLdb.cursors from scrapy import signals logger = logging.getLogger(__name__) class MyExtension(object): def __init__(self, settings,

Scrapy spider_idle signal not received in my extension

北城以北 提交于 2020-01-14 10:42:08
问题 I have common behaviour between several spiders on spider_idle signal being received, and I would like to move this behaviour into an extension. My extension already listens for spider_opened and spider_closed signals successfully. However, the spider_idle signal is not received. Here is my extension (edited for brevity): import logging import MySQLdb import MySQLdb.cursors from scrapy import signals logger = logging.getLogger(__name__) class MyExtension(object): def __init__(self, settings,

小白学 Python 爬虫(39): JavaScript 渲染服务 scrapy-splash 入门

拈花ヽ惹草 提交于 2020-01-14 09:54:21
人生苦短,我用 Python 前文传送门: 小白学 Python 爬虫(1):开篇 小白学 Python 爬虫(2):前置准备(一)基本类库的安装 小白学 Python 爬虫(3):前置准备(二)Linux基础入门 小白学 Python 爬虫(4):前置准备(三)Docker基础入门 小白学 Python 爬虫(5):前置准备(四)数据库基础 小白学 Python 爬虫(6):前置准备(五)爬虫框架的安装 小白学 Python 爬虫(7):HTTP 基础 小白学 Python 爬虫(8):网页基础 小白学 Python 爬虫(9):爬虫基础 小白学 Python 爬虫(10):Session 和 Cookies 小白学 Python 爬虫(11):urllib 基础使用(一) 小白学 Python 爬虫(12):urllib 基础使用(二) 小白学 Python 爬虫(13):urllib 基础使用(三) 小白学 Python 爬虫(14):urllib 基础使用(四) 小白学 Python 爬虫(15):urllib 基础使用(五) 小白学 Python 爬虫(16):urllib 实战之爬取妹子图 小白学 Python 爬虫(17):Requests 基础使用 小白学 Python 爬虫(18):Requests 进阶操作 小白学 Python 爬虫(19):Xpath 基操

Scrapy is following and scraping non-allowed links

痞子三分冷 提交于 2020-01-14 08:58:54
问题 I have a CrawlSpider set up to following certain links and scrape a news magazine where the links to each issue follow the following URL scheme: http://example.com/YYYY/DDDD/index.htm where YYYY is the year and DDDD is the three or four digit issue number. I only want issues 928 onwards, and have my rules below. I don't have any problem connecting to the site, crawling links, or extracting items (so I didn't include the rest of my code). The spider seems determined to follow non-allowed links

Scrapy: enabling files pipeline for absolute and relative paths?

人盡茶涼 提交于 2020-01-14 05:54:48
问题 Question : What am I missing in my code (see "Current Code" section below) that would enable me to download files from both absolute and relative paths using Scrapy? I appreciate the help. I'm feeling lost on how all of these components work together and how to get the desired behavior. Background : I've used a combination of poring over the Scrapy docs, finding comparable examples on GitHub, and trawling StackOverflow for answers, but I can't get the Scrapy files pipeline to work in the way

How to add attributes to a request in a scrapy contract

假装没事ソ 提交于 2020-01-14 03:12:32
问题 Scrapy contract fails if we are instantiating an Item or ItemLoader with the meta attribute or the Request() object passed from a previous parse method. I was thinking of maybe overriding ScrapesContract to preprocess the request and load some dummy values in request.meta, not sure if that is good practice though. I have seen the pre_process method in the docs (illustrated in the HasHeaderContract at the bottom) to get attributes from the request object, but I'm not sure if it can be used to