scrapy | 易学教程

scrapy items are not JSON serializable while storing them to couchdb

阅读更多关于 scrapy items are not JSON serializable while storing them to couchdb

问题 items.py classes import scrapy from scrapy.item import Item, Field import json class Attributes(scrapy.Item): description = Field() pages=Field() author=Field() class Vendor(scrapy.Item): title=Field() order_url=Field() class bookItem(scrapy.Item): title = Field() url = Field() marketprice=Field() images=Field() price=Field() attributes=Field() vendor=Field() time_scraped=Field() my scrapper from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import

Running multiple spiders using scrapyd

阅读更多关于 Running multiple spiders using scrapyd

问题 I had multiple spiders in my project so decided to run them by uploading to scrapyd server. I had uploaded my project succesfully and i can see all the spiders when i run the command curl http://localhost:6800/listspiders.json?project=myproject when i run the following command curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2 Only one spider runs because of only one spider given, but i want to run run multiple spiders here so the following command is right for

scrapy how to make my own scheduler middelware

阅读更多关于 scrapy how to make my own scheduler middelware

问题 I am using Python 2.7 with Scrapy 0.20 My Question How to build my own scheduler ? What I have tried I read through internet and I found this: I have to make my own python class and assign it in the setting using SCHEDULER_MIDDLEWARES Create that class,which is maybe inhertes from scrapy.core.scheduler But I couldn't find any example on internet nor any official documentation 回答1: You can set the SCHEDULER setting: SCHEDULER = 'myproject.schedulers.MyScheduler' and copy the code from

Why does scrapyd throw: “'FeedExporter' object has no attribute 'slot'” exception?

阅读更多关于 Why does scrapyd throw: “'FeedExporter' object has no attribute 'slot'” exception?

问题 I came across a situation where my scrapy code is working fine when used from command line but when I'm using the same spider after deploying (scrapy-deploy) and scheduling with scrapyd api it throws error in "scrapy.extensions.feedexport.FeedExporter" class. one is while initializing "open_spider" signal second is while initializing "item_scraped" signal and last while "close_spider" signal 1."open_spider" signal error 2016-05-14 12:09:38 [scrapy] INFO: Spider opened 2016-05-14 12:09:38

Scrapy not crawling all the pages

阅读更多关于 Scrapy not crawling all the pages

问题 I am trying to crawl sites in a very basic manner. But Scrapy isn't crawling all the links. I will explain the scenario as follows- main_page.html -> contains links to a_page.html, b_page.html, c_page.html a_page.html -> contains links to a1_page.html, a2_page.html b_page.html -> contains links to b1_page.html, b2_page.html c_page.html -> contains links to c1_page.html, c2_page.html a1_page.html -> contains link to b_page.html a2_page.html -> contains link to c_page.html b1_page.html ->

Scrapy shell Error

阅读更多关于 Scrapy shell Error

问题 I am a newbie to Scrapy and going through the tutorials. Ran this command and got some error. C:\Users\Sandra\Anaconda>scrapy shell 'http://scrapy.org' In particular what is this URLError: <urlopen error [Errno 10051] A socket operation was attempted to an unreachable network> Full Error message: 2015-08-20 23:35:08 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot) 2015-08-20 23:35:08 [scrapy] INFO: Optional features available: ssl, http11, boto 2015-08-20 23:35:08 [scrapy] INFO:

twisted critical unhandled error on scrapy tutorial

阅读更多关于 twisted critical unhandled error on scrapy tutorial

问题 I'm new in programming and I'm trying to learn scrapy, using scrapy tutorial: http://doc.scrapy.org/en/latest/intro/tutorial.html So I ran "scrapy crawl dmoz" command and got this error: 2015-07-14 16:11:02 [scrapy] INFO: Scrapy 1.0.1 started (bot: tutorial) 2015-07-14 16:11:02 [scrapy] INFO: Optional features available: ssl, http11 2015-07-14 16:11:02 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tu torial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'}

How to get scraped items from main script using scrapy?

阅读更多关于 How to get scraped items from main script using scrapy?

问题 I hope to get a list of scraped items in main script instead of using scrapy shell. I know there is a method parse in class FooSpider I define, and this method return a list of Item . Scrapy framework calls this method. But, how can I get this returned list by myself? I found so many posts about that, but I don't understand what they were saying. As a context, I put official example code here import scrapy from tutorial.items import DmozItem class DmozSpider(scrapy.Spider): name = "dmoz"

error in deploying a project using scrapyd

阅读更多关于 error in deploying a project using scrapyd

问题 I had multiple spiders in my project folder and want to run all the spiders at once, so i decided to run them using scrapyd service. I have started doing this by seeing here First of all i am in current project folder I had opened the scrapy.cfg file and uncommented the url line after [deploy] I had run scrapy server command, that works fine and scrapyd server runs I tried this command scrapy deploy -l Result : default http://localhost:6800/ when i tried this command scrapy deploy -L scrapyd

port web scraper, scrapy 0.24, to python 3. or scrap scrapy for something better

阅读更多关于 port web scraper, scrapy 0.24, to python 3. or scrap scrapy for something better

问题 I'm trying to use scrapy to make a web scraper but I'm running into many problems since it uses Python2. is it possible to run the 2to3 command on all the files in the tarball simultaneously? Would that cause unforseen errors? Is there an alternative web scraper framework which is more up to date, more functional that might be recommended in stead? I say that because there doesn't seem to be much recent activity on forms on the problems inherent with running version 0.24 of scrapy, i.e. the