scrapy-spider | 易学教程

Use Scrapy to crawl local XML file - Start URL local file address

阅读更多关于 Use Scrapy to crawl local XML file - Start URL local file address

问题 I want to crawl a local xml file that I have located in my Downloads folder with scrapy, use xpath to extract the relevant information. Using the scrapy intro as a guide 2016-01-24 12:38:53 [scrapy] DEBUG: Retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 2 times): [Errno 2] No such file or directory: '/sayth/Downloads/20160123RAND0.xml' 2016-01-24 12:38:53 [scrapy] DEBUG: Gave up retrying <GET file://home/sayth/Downloads/20160123RAND0.xml> (failed 3 times): [Errno 2] No

Scrapy FormRequest , trying to send a post request (FormRequest) with currency change formdata

阅读更多关于 Scrapy FormRequest , trying to send a post request (FormRequest) with currency change formdata

问题 I've been trying to scrapy the following Website but with the currency changed to 'SAR' from the upper left settings form , i tried sending a scrapy request like this: r = Request(url='https://www.mooda.com/en/', cookies=[{'name': 'currency', 'value': 'SAR', 'domain': '.www.mooda.com', 'path': '/'}, {'name':'country','value':'SA','domain': '.www.mooda.com','path':'/'}],dont_filter=True) and i still get the price as EG In [10]: response.css('.price').xpath('text()').extract() Out[10]: [u'1,957

How to recursively crawl subpages with Scrapy

阅读更多关于 How to recursively crawl subpages with Scrapy

So basically I am trying to crawl a page with a set of categories, scrape the names of each category, follow a sublink associated with each category to a page with a set of subcategories, scrape their names, and then follow each subcategory to their associated page and retrieve text data. At the end I want to output a json file formatted somewhat like: Category 1 name Subcategory 1 name data from this subcategory's page Subcategory n name data from this page Category n name Subcategory 1 name data from subcategory n's page etc. Eventually i want to be able to use this data with ElasticSearch I

Why does my CrawlerProcess not have the function “crawl”?

阅读更多关于 Why does my CrawlerProcess not have the function “crawl”?

import scrapy from scrapy.crawler import CrawlerProcess from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from items import BackpageItem, CityvibeItem from scrapy.shell import inspect_response import re import time import sys class MySpider(CrawlSpider): name = 'example' allowed_domains = ['www.example.com'] # Set last_age to decide how many pages are crawled last_page = 10 start_urls = ['http://www.example.com/washington/?page=%s' % page for page in xrange(1,last_page)] rules = ( #Follow all links inside <div class="cat"> and calls

Write functions for all scrapy spiders

阅读更多关于 Write functions for all scrapy spiders

So I'm trying to write functions that can be called upon from all scrapy spiders. Is there one place in my project where I can just define these functions or do I need to import them in each spider? Thanks You can't implicitly import code (at least not without hacking around) in python, after all explicit is better than implicit - so it's not a good idea. However in scrapy it's very common to have base Spider class with common functions and methods. Lets assume you have this tree: ├── myproject │ ├── __init__.py │ ├── spiders │ │ ├── __init__.py │ │ ├── spider1.py │ │ ├── spider2.py ├── scrapy

Scrapy + Splash + ScrapyJS

阅读更多关于 Scrapy + Splash + ScrapyJS

问题 i am using Splash 2.0.2 + Scrapy 1.0.5 + Scrapyjs 0.1.1 and im still not able to render javascript with a click. Here is an example url https://olx.pt/anuncio/loja-nova-com-250m2-garagem-em-box-fechada-para-arrumos-IDyTzAT.html#c49d3d94cf I am still getting the page without the phone number rendered: class OlxSpider(scrapy.Spider): name = "olx" rotate_user_agent = True allowed_domains = ["olx.pt"] start_urls = [ "https://olx.pt/imoveis/" ] def parse(self, response): script = """ function main

Pass extra values along with urls to scrapy spider

阅读更多关于 Pass extra values along with urls to scrapy spider

问题 I've a list of tuples in the form (id,url) I need to crawl a product from a list of urls, and when those products are crawled i need to store them in database under their id. problem is i can't understand how to pass id to parse function so that i can store crawled item under their id. 回答1: Initialize start urls in start_requests() and pass id in meta: class MySpider(Spider): mapping = [(1, 'my_url1'), (2, 'my_url2')] ... def start_requests(self): for id, url in self.mapping: yield Request

CrawlSpider with Splash getting stuck after first URL

阅读更多关于 CrawlSpider with Splash getting stuck after first URL

I'm writing a scrapy spider where I need to render some of the responses with splash. My spider is based on CrawlSpider. I need to render my start_url responses to feed my crawl spider. Unfortunately my crawl spider stops after rendering of the first responds. Any idea what is going wrong? class VideoSpider(CrawlSpider): start_urls = ['https://juke.com/de/de/search?q=1+Mord+f%C3%BCr+2'] rules = ( Rule(LinkExtractor(allow=()), callback='parse_items',process_request = "use_splash",), ) def use_splash(self, request): request.meta['splash'] = { 'endpoint':'render.html', 'args':{ 'wait':0.5, } }

Difference between BaseSpider and CrawlSpider

阅读更多关于 Difference between BaseSpider and CrawlSpider

问题 I have been trying to understand the concept of using BaseSpider and CrawlSpider in web scrapping. I have read the docs. But there is no mention on BaseSpider. It would be really helpful to me if someone explain the differences between BaseSpider and CrawlSpider . 回答1: BaseSpider is something existed before and now is deprecated (since 0.22) - use scrapy.Spider instead: import scrapy class MySpider(scrapy.Spider): # ... scrapy.Spider is the simplest spider that would, basically, visit the

Scrapy- How to extract all blog posts from a category?

阅读更多关于 Scrapy- How to extract all blog posts from a category?

I am using scrapy to extract all the posts of my blog. The problem is I cannot figure out how to create a rule that reads all the posts in any given blog category? example: On my blog the category, "Environment setup" has 17 posts. So in the scrapy code I can hard code it as given but this is not a very practical approach start_urls=["https://edumine.wordpress.com/category/ide- configuration/environment-setup/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/2/","https://edumine.wordpress.com/category/ide-configuration/environment-setup/page/3/"] I have read