scrapy | 易学教程

parse xpath from xml file should contain '

阅读更多关于 parse xpath from xml file should contain '

问题 this is my xml file <Item name="Date" xpath='p[@class="date"]/text()' defaultValue="Date Not Found"></Item> i parse it like this: self.doc=etree.parse(xmlFile) masterItemsFromXML = self.doc.findall('MasterPage/MasterItems/Item') for oneItem in masterItemsFromXML: print 'master item xpath = {0}'.format(oneItem.attrib['xpath']) and I can see the result printed in the cmd like this: master item xpath =p[@class="date"]/text() my problem the xpath is not valid because it should start with ' and

crawlSpider seems not to follow rule

阅读更多关于 crawlSpider seems not to follow rule

问题 here's my code. Actually I followed the example in "Recursively Scraping Web Pages With Scrapy" and it seems I have included a mistake somewhere. Can someone help me find it, please? It's driving me crazy, I only want all the results from all the result pages. Instead it gives me the results from page 1. Here's my code: import scrapy from scrapy.selector import Selector from scrapy.spiders import CrawlSpider, Rule from scrapy.http.request import Request from scrapy.contrib.linkextractors.sgml

Unicode issue in scrapy python

阅读更多关于 Unicode issue in scrapy python

问题 For two hours, I am searching for this topic and I have tried a lot of solutions but noen worked in my case Here's the code first import scrapy class HamburgSpider(scrapy.Spider): name = 'hamburg' #allowed_domains = ['https://www.hamburg.de'] start_urls = ['https://www.hamburg.de/branchenbuch/hamburg/10239785/n0/'] custom_settings = { 'FEED_EXPORT_FORMAT': 'utf-8' } def parse(self, response): #response=response.body.encode('utf-8') items = response.xpath("//div[starts-with(@class, 'item')]")

How to yield in Scrapy without a request?

阅读更多关于 How to yield in Scrapy without a request?

问题 I am trying to crawl a defined list of URLs with Scrapy 2.4 where each of those URLs can have up to 5 paginated URLs that I want to follow. Now also the system works, I do have one extra request I want to get rid of: Those pages are exactly the same but have a different URL: example.html example.thml?pn=1 Somewhere in my code I do this extra request and I can not figure out how to surpress it. This is the working code: Define a bunch of URLs to scrape: start_urls = [ 'https://example...',

scrapy not giving any output

阅读更多关于 scrapy not giving any output

问题 I was following this link and i was able to run a basespider successfully. How ever when i tried using the same with a crawlspider, i was not getting any output. My spider is as follows: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import request from scrapy.selector import HtmlXPathSelector from medsynergies.items import MedsynergiesItem class medsynergiesspider(CrawlSpider): name="medsynergies" allowed

scrapy not giving any output

阅读更多关于 scrapy not giving any output

Implementing own scrapyd service

阅读更多关于 Implementing own scrapyd service

问题 I want to create my own service for scrapyd API, which should return a little more information about running crawler. I get stuck at very beginning: where I should place the module, which will contain that service. If we look at default "scrapyd.conf" it's has a section called services: [services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions

Implementing own scrapyd service

阅读更多关于 Implementing own scrapyd service

彻底搞懂Scrapy的中间件（三）

阅读更多关于彻底搞懂Scrapy的中间件（三）

在前面两篇文章介绍了下载器中间件的使用，这篇文章将会介绍爬虫中间件（Spider Middleware）的使用。爬虫中间件爬虫中间件的用法与下载器中间件非常相似，只是它们的作用对象不同。下载器中间件的作用对象是请求request和返回response；爬虫中间件的作用对象是爬虫，更具体地来说，就是写在spiders文件夹下面的各个文件。它们的关系，在Scrapy的数据流图上可以很好地区分开来，如下图所示。其中，4、5表示下载器中间件，6、7表示爬虫中间件。爬虫中间件会在以下几种情况被调用。当运行到 yield scrapy.Request() 或者 yield item 的时候，爬虫中间件的 process_spider_output() 方法被调用。当爬虫本身的代码出现了 Exception 的时候，爬虫中间件的 process_spider_exception() 方法被调用。当爬虫里面的某一个回调函数 parse_xxx() 被调用之前，爬虫中间件的 process_spider_input() 方法被调用。当运行到 start_requests() 的时候，爬虫中间件的 process_start_requests() 方法被调用。在中间件处理爬虫本身的异常在爬虫中间件里面可以处理爬虫本身的异常。例如编写一个爬虫，爬取UA练习页面 http:/

Scrapy - Creating nested JSON Object

阅读更多关于 Scrapy - Creating nested JSON Object

问题 I'm learning how to work with Scrapy while refreshing my knowledge in Python?/Coding from school. Currently, I'm playing around with imdb top 250 list but struggling with a JSON output file. My current code is: # -*- coding: utf-8 -*- import scrapy from top250imdb.items import Top250ImdbItem class ActorsSpider(scrapy.Spider): name = "actors" allowed_domains = ["imdb.com"] start_urls = ['http://www.imdb.com/chart/top'] # Parsing each movie and preparing the url for the actors list def parse