scrapy

parse xpath from xml file should contain '

痞子三分冷 提交于 2021-02-11 15:55:21
问题 this is my xml file <Item name="Date" xpath='p[@class="date"]/text()' defaultValue="Date Not Found"></Item> i parse it like this: self.doc=etree.parse(xmlFile) masterItemsFromXML = self.doc.findall('MasterPage/MasterItems/Item') for oneItem in masterItemsFromXML: print 'master item xpath = {0}'.format(oneItem.attrib['xpath']) and I can see the result printed in the cmd like this: master item xpath =p[@class="date"]/text() my problem the xpath is not valid because it should start with ' and

crawlSpider seems not to follow rule

柔情痞子 提交于 2021-02-11 14:32:22
问题 here's my code. Actually I followed the example in "Recursively Scraping Web Pages With Scrapy" and it seems I have included a mistake somewhere. Can someone help me find it, please? It's driving me crazy, I only want all the results from all the result pages. Instead it gives me the results from page 1. Here's my code: import scrapy from scrapy.selector import Selector from scrapy.spiders import CrawlSpider, Rule from scrapy.http.request import Request from scrapy.contrib.linkextractors.sgml

Unicode issue in scrapy python

…衆ロ難τιáo~ 提交于 2021-02-11 14:16:48
问题 For two hours, I am searching for this topic and I have tried a lot of solutions but noen worked in my case Here's the code first import scrapy class HamburgSpider(scrapy.Spider): name = 'hamburg' #allowed_domains = ['https://www.hamburg.de'] start_urls = ['https://www.hamburg.de/branchenbuch/hamburg/10239785/n0/'] custom_settings = { 'FEED_EXPORT_FORMAT': 'utf-8' } def parse(self, response): #response=response.body.encode('utf-8') items = response.xpath("//div[starts-with(@class, 'item')]")

How to yield in Scrapy without a request?

梦想的初衷 提交于 2021-02-11 13:39:00
问题 I am trying to crawl a defined list of URLs with Scrapy 2.4 where each of those URLs can have up to 5 paginated URLs that I want to follow. Now also the system works, I do have one extra request I want to get rid of: Those pages are exactly the same but have a different URL: example.html example.thml?pn=1 Somewhere in my code I do this extra request and I can not figure out how to surpress it. This is the working code: Define a bunch of URLs to scrape: start_urls = [ 'https://example...',

scrapy not giving any output

旧时模样 提交于 2021-02-11 06:22:02
问题 I was following this link and i was able to run a basespider successfully. How ever when i tried using the same with a crawlspider, i was not getting any output. My spider is as follows: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import request from scrapy.selector import HtmlXPathSelector from medsynergies.items import MedsynergiesItem class medsynergiesspider(CrawlSpider): name="medsynergies" allowed

scrapy not giving any output

荒凉一梦 提交于 2021-02-11 06:21:34
问题 I was following this link and i was able to run a basespider successfully. How ever when i tried using the same with a crawlspider, i was not getting any output. My spider is as follows: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import request from scrapy.selector import HtmlXPathSelector from medsynergies.items import MedsynergiesItem class medsynergiesspider(CrawlSpider): name="medsynergies" allowed

Implementing own scrapyd service

社会主义新天地 提交于 2021-02-10 23:53:33
问题 I want to create my own service for scrapyd API, which should return a little more information about running crawler. I get stuck at very beginning: where I should place the module, which will contain that service. If we look at default "scrapyd.conf" it's has a section called services: [services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions

Implementing own scrapyd service

自闭症网瘾萝莉.ら 提交于 2021-02-10 23:44:40
问题 I want to create my own service for scrapyd API, which should return a little more information about running crawler. I get stuck at very beginning: where I should place the module, which will contain that service. If we look at default "scrapyd.conf" it's has a section called services: [services] schedule.json = scrapyd.webservice.Schedule cancel.json = scrapyd.webservice.Cancel addversion.json = scrapyd.webservice.AddVersion listprojects.json = scrapyd.webservice.ListProjects listversions

彻底搞懂Scrapy的中间件(三)

我怕爱的太早我们不能终老 提交于 2021-02-10 18:48:03
在前面两篇文章介绍了下载器中间件的使用,这篇文章将会介绍爬虫中间件(Spider Middleware)的使用。 爬虫中间件 爬虫中间件的用法与下载器中间件非常相似,只是它们的作用对象不同。下载器中间件的作用对象是请求request和返回response;爬虫中间件的作用对象是爬虫,更具体地来说,就是写在spiders文件夹下面的各个文件。它们的关系,在Scrapy的数据流图上可以很好地区分开来,如下图所示。 其中,4、5表示下载器中间件,6、7表示爬虫中间件。爬虫中间件会在以下几种情况被调用。 当运行到 yield scrapy.Request() 或者 yield item 的时候,爬虫中间件的 process_spider_output() 方法被调用。 当爬虫本身的代码出现了 Exception 的时候,爬虫中间件的 process_spider_exception() 方法被调用。 当爬虫里面的某一个回调函数 parse_xxx() 被调用之前,爬虫中间件的 process_spider_input() 方法被调用。 当运行到 start_requests() 的时候,爬虫中间件的 process_start_requests() 方法被调用。 在中间件处理爬虫本身的异常 在爬虫中间件里面可以处理爬虫本身的异常。例如编写一个爬虫,爬取UA练习页面 http:/

Scrapy - Creating nested JSON Object

二次信任 提交于 2021-02-10 18:16:13
问题 I'm learning how to work with Scrapy while refreshing my knowledge in Python?/Coding from school. Currently, I'm playing around with imdb top 250 list but struggling with a JSON output file. My current code is: # -*- coding: utf-8 -*- import scrapy from top250imdb.items import Top250ImdbItem class ActorsSpider(scrapy.Spider): name = "actors" allowed_domains = ["imdb.com"] start_urls = ['http://www.imdb.com/chart/top'] # Parsing each movie and preparing the url for the actors list def parse