scrapy-spider

Items vs item loaders in scrapy

ぃ、小莉子 提交于 2019-12-03 03:54:56
问题 I'm pretty new to scrapy, I know that items are used to populate scraped data, but I cant understand the difference between items and item loaders. I tried to read some example codes, they used item loaders to store instead of items and I can't understand why. Scrapy documentation wasn't clear enough for me. Can anyone give a simple explanation (better with example) about when item loaders are used and what additional facilities do they provide over items ? 回答1: I really like the official

Logging to specific error log file in scrapy

喜你入骨 提交于 2019-12-03 03:54:00
I am running a log of scrapy by doing this: from scrapy import log class MySpider(BaseSpider): name = "myspider" def __init__(self, name=None, **kwargs): LOG_FILE = "logs/spider.log" log.log.defaultObserver = log.log.DefaultObserver() log.log.defaultObserver.start() log.started = False log.start(LOG_FILE, loglevel=log.INFO) super(MySpider, self).__init__(name, **kwargs) def parse(self,response): .... raise Exception("Something went wrong!") log.msg('Something went wrong!', log.ERROR) # Somehow write to a separate error log here. Then I run the spider like this: scrapy crawl myspider This would

Multiple inheritance in scrapy spiders

喜夏-厌秋 提交于 2019-12-03 03:31:32
Is it possible to create a spider which inherits the functionality from two base spiders, namely SitemapSpider and CrawlSpider? I have been trying to scrape data from various sites and realized that not all sites have listing of every page on the website, thus a need to use CrawlSpider. But CrawlSpider goes through a lot of junk pages and is kind of an overkill. What I would like to do is something like this: Start my Spider which is a subclass of SitemapSpider and pass regex matched responses to the parse_products to extract useful information method. Go to links matching the regex: /reviews/

How to prevent getting blacklisted while scraping Amazon [closed]

ε祈祈猫儿з 提交于 2019-12-03 00:46:06
I try to scrape Amazon by Scrapy. but i have this error DEBUG: Retrying <GET http://www.amazon.fr/Amuses-bouche-Peuvent-b%C3%A9n%C3%A9ficier-dAmazon-Premium-Epicerie/s?ie=UTF8&page=1&rh=n%3A6356734031%2Cp_76%3A437878031> (failed 1 times): 503 Service Unavailable I think that it's because = Amazon is very good at detecting bots. How can i prevent this? i used time.sleep(6) before every request. I don't want to use their API. I tried I use tor and polipo You have to be very careful with Amazon and follow the Amazon Terms of Use and policies related to web-scraping. Amazon is quite good at

Get scrapy spider to crawl entire site

烂漫一生 提交于 2019-12-03 00:04:40
I am using scrapy to crawl old sites that I own, I am using the code below as my spider. I don't mind having files outputted for each webpage, or a database with all the content within that. But I do need to be able to have the spider crawl the whole thing with out me having to put in every single url that I am currently having to do import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["www.example.com"] start_urls = [ "http://www.example.com/contactus" ] def parse(self, response): filename = response.url.split("/")[-2] + '.html' with open(filename, 'wb') as f: f

Scrapy spider_idle signal - need to add requests with parse item callback

风流意气都作罢 提交于 2019-12-02 22:32:02
问题 In my Scrapy spider I have overridden the start_requests() method, in order to retrieve some additional urls from a database, that represent items potentially missed in the crawl (orphaned items). This should happen at the end of the crawling process. Something like (pseudo code): def start_requests(self): for url in self.start_urls: yield Request(url, dont_filter=True) # attempt to crawl orphaned items db = MySQLdb.connect(host=self.settings['AWS_RDS_HOST'], port=self.settings['AWS_RDS_PORT'

Items vs item loaders in scrapy

前提是你 提交于 2019-12-02 19:33:13
I'm pretty new to scrapy, I know that items are used to populate scraped data, but I cant understand the difference between items and item loaders. I tried to read some example codes, they used item loaders to store instead of items and I can't understand why. Scrapy documentation wasn't clear enough for me. Can anyone give a simple explanation (better with example) about when item loaders are used and what additional facilities do they provide over items ? I really like the official explanation in the docs: Item Loaders provide a convenient mechanism for populating scraped Items. Even though

ImportError: No module named win32api while using Scrapy

梦想的初衷 提交于 2019-12-02 19:14:05
I am a new learner of Scrapy. I installed python 2.7 and all other engines needed. Then I tried to build a Scrapy project following the tutorial http://doc.scrapy.org/en/latest/intro/tutorial.html . In the crawling step, after I typed scrapy crawl dmoz it generated this error message ImportError: No module named win32api. [twisted] CRITICAL : Unhandled error in deferred I am using Windows. Stack trace: I am using Windows. Alfan Dinda Rahmawan Try this. pip install pypiwin32 If you search a bit along the internet you will find the following documentation which describes what you have to do to

Scrapy: AttributeError: 'list' object has no attribute 'iteritems'

时间秒杀一切 提交于 2019-12-02 19:10:37
This is my first question on stack overflow. Recently I want to use linked-in-scraper , so I downloaded and instruct "scrapy crawl linkedin.com" and get the below error message. For your information, I use anaconda 2.3.0 and python 2.7.11. All the related packages, including scrapy and six, are updated by pip before executing the program. Traceback (most recent call last): File "/Users/byeongsuyu/anaconda/bin/scrapy", line 11, in <module> sys.exit(execute()) File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/cmdline.py", line 108, in execute settings = get_project_settings()

Trouble renaming downloaded images in a customized manner through pipelines

こ雲淡風輕ζ 提交于 2019-12-02 18:39:52
问题 I've created a script using python's scrapy module to download and rename movie images from a torrent site and store them in a folder within scrapy project. When I run my script as it is, I find it downloading images in that folder folder errorlessly. At this moment the script is renaming those images using a convenient portion from request.url through pipelines.py . How can I rename those downloaded images through pipelines.py using their movie names from the variable movie defined within