Xpath Error - Spider error processing

匿名 (未验证) 提交于 2019-12-03 09:05:37

问题:

So i am building this spider and it crawls fine, because i can log into the shell and go through the HTML page and test my Xpath queries.

Not sure what i am doing wrong. Any help would be appreciated. I have re installed Twisted, but nothing.

My spider looks like this -

from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from spider_scrap.items import spiderItem  class spider(BaseSpider): name="spider1" #allowed_domains = ["example.com"] start_urls = [                                 "http://www.example.com"             ]  def parse(self, response):  items = []      hxs = HtmlXPathSelector(response)     sites = hxs.select('//*[@id="search_results"]/div[1]/div')      for site in sites:         item = spiderItem()         item['title'] = site.select('div[2]/h2/a/text()').extract                            item['author'] = site.select('div[2]/span/a/text()').extract             item['price'] = site.select('div[3]/div[1]/div[1]/div/b/text()').extract()          items.append(item)     return items 

When i run spider - scrapy crawl Spider1 i get the following error -

    2012-09-25 17:56:12-0400 [scrapy] DEBUG: Enabled item pipelines:     2012-09-25 17:56:12-0400 [Spider1] INFO: Spider opened     2012-09-25 17:56:12-0400 [Spider1] INFO: Crawled 0 pages (at 0 pages/min), scraped  0 items (at 0 items/min)     2012-09-25 17:56:12-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023     2012-09-25 17:56:12-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080    2012-09-25 17:56:15-0400 [Spider1] DEBUG: Crawled (200) <GET http://www.example.com>  (refere    r: None)    2012-09-25 17:56:15-0400 [Spider1] ERROR: Spider error processing <GET    http://www.example.com     s>     Traceback (most recent call last):       File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1178, in mainLoop         self.runUntilCurrent()       File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 800, in runUntilCurrent         call.func(*call.args, **call.kw)       File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 368, in callback         self._startRunCallbacks(result)       File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 464, in _startRunCallbacks         self._runCallbacks()     --- <exception caught here> ---       File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 551, in _runCallbacks         current.result = callback(current.result, *args, **kw)       File "C:\Python27\lib\site-packages\scrapy\spider.py", line 62, in parse         raise NotImplementedError     exceptions.NotImplementedError:       2012-09-25 17:56:15-0400 [Spider1] INFO: Closing spider (finished)      2012-09-25 17:56:15-0400 [Spider1] INFO: Dumping spider stats:     {'downloader/request_bytes': 231,      'downloader/request_count': 1,      'downloader/request_method_count/GET': 1,      'downloader/response_bytes': 186965,      'downloader/response_count': 1,      'downloader/response_status_count/200': 1,      'finish_reason': 'finished',      'finish_time': datetime.datetime(2012, 9, 25, 21, 56, 15, 326000),      'scheduler/memory_enqueued': 1,      'spider_exceptions/NotImplementedError': 1,      'start_time': datetime.datetime(2012, 9, 25, 21, 56, 12, 157000)}       2012-09-25 17:56:15-0400 [Spider1] INFO: Spider closed (finished)       2012-09-25 17:56:15-0400 [scrapy] INFO: Dumping global stats:     {} 

回答1:

For everyone who faces this problem, please, make sure you didn't rename parse() method like I did:

class CakeSpider(CrawlSpider):     name            = "cakes"     allowed_domains = ["cakes.com"]     start_urls      = ["http://www.cakes.com/catalog"]      def parse(self, response): #this should be 'parse' and nothing else          #yourcode# 

Otherwise it throws the same error:

... File "C:\Python27\lib\site-packages\scrapy\spider.py", line 62, in parse     raise NotImplementedError     exceptions.NotImplementedError: 

I've spent like three hours trying to figure out -.-



回答2:

Leo is right, the indenting is not correct. You probably have some tabs and spaces mixed up together in your script because you pasted some code and typed in other code yourself and your editor allowed for both tabs and spaces in the same file. Convert all tabs to spaces so it's more like:

from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from spider_scrap.items import spiderItem  class spider(BaseSpider):     name = "spider1"     start_urls = ["http://www.example.com"]      def parse(self, response):         items = []         hxs = HtmlXPathSelector(response)         sites = hxs.select('//*[@id="search_results"]/div[1]/div')          for site in sites:             item = spiderItem()             item['title'] = site.select('div[2]/h2/a/text()').extract             item['author'] = site.select('div[2]/span/a/text()').extract             item['price'] = site.select('div[3]/div[1]/div[1]/div/b/text()').extract()             items.append(item)          return items 


回答3:

your parse method is out of class code , use bellow mentioned code

from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from spider_scrap.items import spiderItem  class spider(BaseSpider):     name="spider1"     allowed_domains = ["example.com"]     start_urls = [       "http://www.example.com"      ]      def parse(self, response):         items = []         hxs = HtmlXPathSelector(response)         sites = hxs.select('//*[@id="search_results"]/div[1]/div')          for site in sites:             item = spiderItem()             item['title'] = site.select('div[2]/h2/a/text()').extract                            item['author'] = site.select('div[2]/span/a/text()').extract             item['price'] = site.select('div[3]/div[1]/div[1]/div/b/text()').extract()         items.append(item)         return items 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!