scrapy | 易学教程

Relative URL to absolute URL Scrapy

阅读更多关于 Relative URL to absolute URL Scrapy

问题 I need help to convert relative URL to absolute URL in Scrapy spider. I need to convert links on my start pages to absolute URL to get the images of the scrawled items, which are on the start pages. I unsuccessfully tried different ways to achieve this and I'm stuck. Any suggestion? class ExampleSpider(scrapy.Spider): name = "example" allowed_domains = ["example.com"] start_urls = [ "http://www.example.com/billboard", "http://www.example.com/billboard?page=1" ] def parse(self, response):

Is it possible for Scrapy to get plain text from raw HTML data?

阅读更多关于 Is it possible for Scrapy to get plain text from raw HTML data?

问题 For example: scrapy shell http://scrapy.org/ content = hxs.select('//*[@id="content"]').extract()[0] print content Then, I get the following raw HTML code: <div id="content"> <h2>Welcome to Scrapy</h2> <h3>What is Scrapy?</h3> <p>Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.</p> <h3>Features</h3> <dl>

Is it possible for Scrapy to get plain text from raw HTML data?

阅读更多关于 Is it possible for Scrapy to get plain text from raw HTML data?

Is it possible to pass a variable from start_requests() to parse() for each individual request?

阅读更多关于 Is it possible to pass a variable from start_requests() to parse() for each individual request?

问题 I'm using a loop to generate my requests inside start_request() and I'd like to pass the index to parse() so it can store it in the item. However when I use self.i the output has the i max value (last loop turn) for every items. I can use response.url.re('regex to extract the index') but I wonder if there is a clean way to pass a variable from start_requests to parse. 回答1: You can use scrapy.Request meta attribute: import scrapy class MySpider(scrapy.Spider): name = 'myspider' def start

Passing selenium driver to scrapy

阅读更多关于 Passing selenium driver to scrapy

问题 I've spent a long time trying to figure this out to no avail. I've read a lot about passing back HtmlResponse and using selenium middleware but have struggled to understand how to structure the code and implement into my solution. Here is my spider code: import scrapy from selenium import webdriver from selenium.webdriver.common.keys import Keys from time import sleep count = 0 class ContractSpider(scrapy.Spider): name = "contracts" def start_requests(self): urls = [ 'https://www

How do i create rules for a crawlspider using scrapy

阅读更多关于 How do i create rules for a crawlspider using scrapy

问题 from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from manga.items import MangaItem class MangaHere(BaseSpider): name = "mangah" allowed_domains = ["mangahere.com"] start_urls = ["http://www.mangahere.com/seinen/"] def parse(self,response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li/div') items = [] for site in sites: rating = site.select("p/span/text()").extract() if rating > 4.5: item = MangaItem() item["title"] = site.select("div/a/text()

scrapyd-client command not found

阅读更多关于 scrapyd-client command not found

问题 I'd just installed the scrapyd-client(1.1.0) in a virtualenv, and run command 'scrapyd-deploy' successfully, but when I run 'scrapyd-client', the terminal said: command not found: scrapyd-client. According to the readme file(https://github.com/scrapy/scrapyd-client), there should be a 'scrapyd-client' command. I had checked the path '/lib/python2.7/site-packages/scrapyd-client', only 'scrapyd-deploy' in the folder. Is the command 'scrapyd-client' being removed for now? 回答1: Create a fresh

Scrapy: how to catch download error and try download it again

阅读更多关于 Scrapy: how to catch download error and try download it again

问题 During my crawling, some pages failed due to unexpected redirection and no response returned. How can I catch this kind of error and re-schedule a request with original url, not with the redirected url? Before I ask here, I do a lot of search with Google. Looks there's two ways to fix this issue. one is catch exception in a download middle-ware, the other is to process download exception in errback in spider's request. For these two questions, I have some questions. For method 1, I don't know

Scrapy Spider: Restart spider when finishes

阅读更多关于 Scrapy Spider: Restart spider when finishes

问题 I'm trying to make my Scrapy spider to launch again if the closed reason is because of my internet connection (during night internet goes down for 5 minutes). When internet goes down the spider closes after 5 tries. I'm trying to use this function inside my spider definition trying to restart the spider when closed: def handle_spider_closed(spider, reason): relaunch = False for key in spider.crawler.stats._stats.keys(): if 'DNSLookupError' in key: relaunch = True break if relaunch: spider =

Scrapy middleware order

阅读更多关于 Scrapy middleware order

问题 Scrapy documentation says : the first middleware is the one closer to the engine and the last is the one closer to the downloader. To decide which order to assign to your middleware see the DOWNLOADER_MIDDLEWARES_BASE setting and pick a value according to where you want to insert the middleware. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied I'm not entirely clear from this