scrapy-spider

Use Scrapy to crawl local XML file - Start URL local file address

余生颓废 提交于 2019-12-06 10:50:46
问题 I want to crawl a local xml file that I have located in my Downloads folder with scrapy, use xpath to extract the relevant information. Using the scrapy intro as a guide 2016-01-24 12:38:53 [scrapy] DEBUG: Retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 2 times): [Errno 2] No such file or directory: '/sayth/Downloads/20160123RAND0.xml' 2016-01-24 12:38:53 [scrapy] DEBUG: Gave up retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 3 times): [Errno 2] No

Scrapy FormRequest , trying to send a post request (FormRequest) with currency change formdata

妖精的绣舞 提交于 2019-12-06 10:09:04
问题 I've been trying to scrapy the following Website but with the currency changed to 'SAR' from the upper left settings form , i tried sending a scrapy request like this: r = Request(url='https://www.mooda.com/en/', cookies=[{'name': 'currency', 'value': 'SAR', 'domain': '.www.mooda.com', 'path': '/'}, {'name':'country','value':'SA','domain': '.www.mooda.com','path':'/'}],dont_filter=True) and i still get the price as EG In [10]: response.css('.price').xpath('text()').extract() Out[10]: [u'1,957

How to recursively crawl subpages with Scrapy

一曲冷凌霜 提交于 2019-12-06 05:58:49
So basically I am trying to crawl a page with a set of categories, scrape the names of each category, follow a sublink associated with each category to a page with a set of subcategories, scrape their names, and then follow each subcategory to their associated page and retrieve text data. At the end I want to output a json file formatted somewhat like: Category 1 name Subcategory 1 name data from this subcategory's page Subcategory n name data from this page Category n name Subcategory 1 name data from subcategory n's page etc. Eventually i want to be able to use this data with ElasticSearch I

Why does my CrawlerProcess not have the function “crawl”?

扶醉桌前 提交于 2019-12-06 05:25:00
import scrapy from scrapy.crawler import CrawlerProcess from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from items import BackpageItem, CityvibeItem from scrapy.shell import inspect_response import re import time import sys class MySpider(CrawlSpider): name = 'example' allowed_domains = ['www.example.com'] # Set last_age to decide how many pages are crawled last_page = 10 start_urls = ['http://www.example.com/washington/?page=%s' % page for page in xrange(1,last_page)] rules = ( #Follow all links inside <div class="cat"> and calls

Write functions for all scrapy spiders

三世轮回 提交于 2019-12-06 04:38:13
So I'm trying to write functions that can be called upon from all scrapy spiders. Is there one place in my project where I can just define these functions or do I need to import them in each spider? Thanks You can't implicitly import code (at least not without hacking around) in python, after all explicit is better than implicit - so it's not a good idea. However in scrapy it's very common to have base Spider class with common functions and methods. Lets assume you have this tree: ├── myproject │ ├── __init__.py │ ├── spiders │ │ ├── __init__.py │ │ ├── spider1.py │ │ ├── spider2.py ├── scrapy

Scrapy + Splash + ScrapyJS

倖福魔咒の 提交于 2019-12-06 02:19:24
问题 i am using Splash 2.0.2 + Scrapy 1.0.5 + Scrapyjs 0.1.1 and im still not able to render javascript with a click. Here is an example url https://olx.pt/anuncio/loja-nova-com-250m2-garagem-em-box-fechada-para-arrumos-IDyTzAT.html#c49d3d94cf I am still getting the page without the phone number rendered: class OlxSpider(scrapy.Spider): name = "olx" rotate_user_agent = True allowed_domains = ["olx.pt"] start_urls = [ "https://olx.pt/imoveis/" ] def parse(self, response): script = """ function main

Pass extra values along with urls to scrapy spider

女生的网名这么多〃 提交于 2019-12-06 01:26:54
问题 I've a list of tuples in the form (id,url) I need to crawl a product from a list of urls, and when those products are crawled i need to store them in database under their id. problem is i can't understand how to pass id to parse function so that i can store crawled item under their id. 回答1: Initialize start urls in start_requests() and pass id in meta: class MySpider(Spider): mapping = [(1, 'my_url1'), (2, 'my_url2')] ... def start_requests(self): for id, url in self.mapping: yield Request

CrawlSpider with Splash getting stuck after first URL

主宰稳场 提交于 2019-12-05 21:31:37
I'm writing a scrapy spider where I need to render some of the responses with splash. My spider is based on CrawlSpider. I need to render my start_url responses to feed my crawl spider. Unfortunately my crawl spider stops after rendering of the first responds. Any idea what is going wrong? class VideoSpider(CrawlSpider): start_urls = ['https://juke.com/de/de/search?q=1+Mord+f%C3%BCr+2'] rules = ( Rule(LinkExtractor(allow=()), callback='parse_items',process_request = "use_splash",), ) def use_splash(self, request): request.meta['splash'] = { 'endpoint':'render.html', 'args':{ 'wait':0.5, } }

Difference between BaseSpider and CrawlSpider

我的未来我决定 提交于 2019-12-05 20:16:23
问题 I have been trying to understand the concept of using BaseSpider and CrawlSpider in web scrapping. I have read the docs. But there is no mention on BaseSpider. It would be really helpful to me if someone explain the differences between BaseSpider and CrawlSpider . 回答1: BaseSpider is something existed before and now is deprecated (since 0.22) - use scrapy.Spider instead: import scrapy class MySpider(scrapy.Spider): # ... scrapy.Spider is the simplest spider that would, basically, visit the

Scrapy- How to extract all blog posts from a category?

拈花ヽ惹草 提交于 2019-12-05 19:42:28
I am using scrapy to extract all the posts of my blog. The problem is I cannot figure out how to create a rule that reads all the posts in any given blog category? example: On my blog the category, "Environment setup" has 17 posts. So in the scrapy code I can hard code it as given but this is not a very practical approach start_urls=["https://edumine.wordpress.com/category/ide- configuration/environment-setup/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/3/"] I have read