scrapy | 易学教程

Scrapy pipeline extracting in the wrong csv format

阅读更多关于 Scrapy pipeline extracting in the wrong csv format

问题 My Hacker News spider outputs all the results on one line, instead of one each line, as it can be seen here. All on the same line Here is my code. import scrapy import string import urlparse from scrapy.selector import Selector from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors import LinkExtractor class HnItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() score = scrapy.Field() class HnSpider(scrapy.Spider): name = 'hackernews' allowed_domains = [

Scrapy: URL error, Program adds unnecessary characters(URL-codes)

阅读更多关于 Scrapy: URL error, Program adds unnecessary characters(URL-codes)

问题 im using Scrapyto crawl a german forum: http://www.musikerboard.de/forum It follows all subforums and extracts Information from threads. The problem: During crawling it gives me an error on ultiple threadlinks: 2015-09-26 14:01:59 [scrapy] DEBUG: Ignoring response <404 http://www.musiker-board.de/threads/spotify-premium-paket.621224/%0A%09%09>: HTTP status code is not handled or not allowed The URL is fine except for this part /%0A%09%09 It gives an 404 error. I dont know why the program

Scraping site that uses AJAX

阅读更多关于 Scraping site that uses AJAX

问题 I've read some relevant posts here but couldn't figure an answer. I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is

Runtime Error on AWS Lambda with Scrapy - Reuse container issue

阅读更多关于 Runtime Error on AWS Lambda with Scrapy - Reuse container issue

问题 I had a problem with AWS Lambda container and Scrapy. When I execute the code in local with SAM, it never fails but when execute the code in AWS Lambda containers two times in a short period of time, it produce this error: START RequestId: cbd8f1cf-a9a1-41eb-89e9-bedf5ba1a0f7 Version: $LATEST 2019-01-24 12:02:01 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: scrapybot) 2019-01-24 12:02:01 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib

Python Scrapy: passing properties into parser

阅读更多关于 Python Scrapy: passing properties into parser

问题 I'm new to Scrapy and web-scraping in general so this might be a stupid question but it wouldn't be the first time so here goes. I have a simple Scrapy spider, based on the tutorial example, that processes various URLs (in start_urls). I would like to categorise the URLs e.g. URLs A, B, and C are Category 1, while URLS D and E are Category 2, then be able to store the category on the resulting Items when the parser processes the response for each URL. I guess I could have a separate spider

Python Scrapy: passing properties into parser

阅读更多关于 Python Scrapy: passing properties into parser

Setting sticky cookie in scrapy

阅读更多关于 Setting sticky cookie in scrapy

问题 The website I am scraping has javascript that sets a cookie and checks it in the backend to make sure js is enabled. Extracting the cookie from the html code is simple enough, but then setting it seems to be a problem in scrapy. So my code is: from scrapy.contrib.spiders.init import InitSpider class TestSpider(InitSpider): ... rules = (Rule(SgmlLinkExtractor(allow=('products/./index\.html', )), callback='parse_page'),) def init_request(self): return Request(url = self.init_url, callback=self

Empty variable within instance of a class, despite specifically setting it

阅读更多关于 Empty variable within instance of a class, despite specifically setting it

问题 When I run the following code: import scrapy from scrapy.crawler import CrawlerProcess class QuotesSpider(scrapy.Spider): name = "quotes" search_url = '' def start_requests(self): print ('self.search_url is currently: ' + self.search_url) yield scrapy.Request(url=self.search_url, callback=self.parse) def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename)

Scrapy send stats to a URL passed as argument as a POST request every 5 minutes

阅读更多关于 Scrapy send stats to a URL passed as argument as a POST request every 5 minutes

问题 I need to send the crawler stats to a URL which is passed on as a spider argument. I need to make a POST request at regular intervals of 5 minutes. How can I do that? 回答1: You will probably want to write an extension that simply makes a post request every 5 minutes. You can make these requests either using scrapy's own mechanisms (e.g. engine.download() ), or you can use a different async http client (e.g. treq) If you're not sure how to structure your extension, you can take a look at

Unable to rename downloaded images through pipelines without the usage of item.py

阅读更多关于 Unable to rename downloaded images through pipelines without the usage of item.py

问题 I've created a script using python's scrapy module to download and rename movie images from multiple pages out of a torrent site and store them in a desktop folder. When it is about downloading and storing those images in a desktop folder, my script is the same errorlessly. However, what I'm struggling to do now is rename those files on the fly. As I didn't make use of item.py file and I do not wish to either, I hardly understand how the logic of pipelines.py file would be to handle the