scrapy

Relative URL to absolute URL Scrapy

瘦欲@ 提交于 2020-01-01 08:38:35
问题 I need help to convert relative URL to absolute URL in Scrapy spider. I need to convert links on my start pages to absolute URL to get the images of the scrawled items, which are on the start pages. I unsuccessfully tried different ways to achieve this and I'm stuck. Any suggestion? class ExampleSpider(scrapy.Spider): name = "example" allowed_domains = ["example.com"] start_urls = [ "http://www.example.com/billboard", "http://www.example.com/billboard?page=1" ] def parse(self, response):

Is it possible for Scrapy to get plain text from raw HTML data?

房东的猫 提交于 2020-01-01 07:52:49
问题 For example: scrapy shell http://scrapy.org/ content = hxs.select('//*[@id="content"]').extract()[0] print content Then, I get the following raw HTML code: <div id="content"> <h2>Welcome to Scrapy</h2> <h3>What is Scrapy?</h3> <p>Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.</p> <h3>Features</h3> <dl>

Is it possible for Scrapy to get plain text from raw HTML data?

喜欢而已 提交于 2020-01-01 07:51:47
问题 For example: scrapy shell http://scrapy.org/ content = hxs.select('//*[@id="content"]').extract()[0] print content Then, I get the following raw HTML code: <div id="content"> <h2>Welcome to Scrapy</h2> <h3>What is Scrapy?</h3> <p>Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.</p> <h3>Features</h3> <dl>

Is it possible to pass a variable from start_requests() to parse() for each individual request?

末鹿安然 提交于 2020-01-01 07:35:06
问题 I'm using a loop to generate my requests inside start_request() and I'd like to pass the index to parse() so it can store it in the item. However when I use self.i the output has the i max value (last loop turn) for every items. I can use response.url.re('regex to extract the index') but I wonder if there is a clean way to pass a variable from start_requests to parse. 回答1: You can use scrapy.Request meta attribute: import scrapy class MySpider(scrapy.Spider): name = 'myspider' def start

Passing selenium driver to scrapy

喜你入骨 提交于 2020-01-01 07:22:25
问题 I've spent a long time trying to figure this out to no avail. I've read a lot about passing back HtmlResponse and using selenium middleware but have struggled to understand how to structure the code and implement into my solution. Here is my spider code: import scrapy from selenium import webdriver from selenium.webdriver.common.keys import Keys from time import sleep count = 0 class ContractSpider(scrapy.Spider): name = "contracts" def start_requests(self): urls = [ 'https://www

How do i create rules for a crawlspider using scrapy

前提是你 提交于 2020-01-01 06:42:09
问题 from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from manga.items import MangaItem class MangaHere(BaseSpider): name = "mangah" allowed_domains = ["mangahere.com"] start_urls = ["http://www.mangahere.com/seinen/"] def parse(self,response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li/div') items = [] for site in sites: rating = site.select("p/span/text()").extract() if rating > 4.5: item = MangaItem() item["title"] = site.select("div/a/text()

scrapyd-client command not found

风流意气都作罢 提交于 2020-01-01 05:31:25
问题 I'd just installed the scrapyd-client(1.1.0) in a virtualenv, and run command 'scrapyd-deploy' successfully, but when I run 'scrapyd-client', the terminal said: command not found: scrapyd-client. According to the readme file(https://github.com/scrapy/scrapyd-client), there should be a 'scrapyd-client' command. I had checked the path '/lib/python2.7/site-packages/scrapyd-client', only 'scrapyd-deploy' in the folder. Is the command 'scrapyd-client' being removed for now? 回答1: Create a fresh

Scrapy: how to catch download error and try download it again

∥☆過路亽.° 提交于 2020-01-01 05:29:12
问题 During my crawling, some pages failed due to unexpected redirection and no response returned. How can I catch this kind of error and re-schedule a request with original url, not with the redirected url? Before I ask here, I do a lot of search with Google. Looks there's two ways to fix this issue. one is catch exception in a download middle-ware, the other is to process download exception in errback in spider's request. For these two questions, I have some questions. For method 1, I don't know

Scrapy Spider: Restart spider when finishes

佐手、 提交于 2020-01-01 05:22:07
问题 I'm trying to make my Scrapy spider to launch again if the closed reason is because of my internet connection (during night internet goes down for 5 minutes). When internet goes down the spider closes after 5 tries. I'm trying to use this function inside my spider definition trying to restart the spider when closed: def handle_spider_closed(spider, reason): relaunch = False for key in spider.crawler.stats._stats.keys(): if 'DNSLookupError' in key: relaunch = True break if relaunch: spider =

Scrapy middleware order

回眸只為那壹抹淺笑 提交于 2020-01-01 05:20:09
问题 Scrapy documentation says : the first middleware is the one closer to the engine and the last is the one closer to the downloader. To decide which order to assign to your middleware see the DOWNLOADER_MIDDLEWARES_BASE setting and pick a value according to where you want to insert the middleware. The order does matter because each middleware performs a different action and your middleware could depend on some previous (or subsequent) middleware being applied I'm not entirely clear from this