scrapy

Scrapy pipeline extracting in the wrong csv format

坚强是说给别人听的谎言 提交于 2019-12-25 03:43:08
问题 My Hacker News spider outputs all the results on one line, instead of one each line, as it can be seen here. All on the same line Here is my code. import scrapy import string import urlparse from scrapy.selector import Selector from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors import LinkExtractor class HnItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() score = scrapy.Field() class HnSpider(scrapy.Spider): name = 'hackernews' allowed_domains = [

Scrapy: URL error, Program adds unnecessary characters(URL-codes)

不羁岁月 提交于 2019-12-25 03:42:41
问题 im using Scrapyto crawl a german forum: http://www.musikerboard.de/forum It follows all subforums and extracts Information from threads. The problem: During crawling it gives me an error on ultiple threadlinks: 2015-09-26 14:01:59 [scrapy] DEBUG: Ignoring response <404 http://www.musiker-board.de/threads/spotify-premium-paket.621224/%0A%09%09>: HTTP status code is not handled or not allowed The URL is fine except for this part /%0A%09%09 It gives an 404 error. I dont know why the program

Scraping site that uses AJAX

筅森魡賤 提交于 2019-12-25 03:34:37
问题 I've read some relevant posts here but couldn't figure an answer. I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is

Runtime Error on AWS Lambda with Scrapy - Reuse container issue

戏子无情 提交于 2019-12-25 02:55:33
问题 I had a problem with AWS Lambda container and Scrapy. When I execute the code in local with SAM, it never fails but when execute the code in AWS Lambda containers two times in a short period of time, it produce this error: START RequestId: cbd8f1cf-a9a1-41eb-89e9-bedf5ba1a0f7 Version: $LATEST 2019-01-24 12:02:01 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: scrapybot) 2019-01-24 12:02:01 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.1, w3lib

Python Scrapy: passing properties into parser

血红的双手。 提交于 2019-12-25 02:52:15
问题 I'm new to Scrapy and web-scraping in general so this might be a stupid question but it wouldn't be the first time so here goes. I have a simple Scrapy spider, based on the tutorial example, that processes various URLs (in start_urls). I would like to categorise the URLs e.g. URLs A, B, and C are Category 1, while URLS D and E are Category 2, then be able to store the category on the resulting Items when the parser processes the response for each URL. I guess I could have a separate spider

Python Scrapy: passing properties into parser

泄露秘密 提交于 2019-12-25 02:51:54
问题 I'm new to Scrapy and web-scraping in general so this might be a stupid question but it wouldn't be the first time so here goes. I have a simple Scrapy spider, based on the tutorial example, that processes various URLs (in start_urls). I would like to categorise the URLs e.g. URLs A, B, and C are Category 1, while URLS D and E are Category 2, then be able to store the category on the resulting Items when the parser processes the response for each URL. I guess I could have a separate spider

Setting sticky cookie in scrapy

痞子三分冷 提交于 2019-12-25 01:44:01
问题 The website I am scraping has javascript that sets a cookie and checks it in the backend to make sure js is enabled. Extracting the cookie from the html code is simple enough, but then setting it seems to be a problem in scrapy. So my code is: from scrapy.contrib.spiders.init import InitSpider class TestSpider(InitSpider): ... rules = (Rule(SgmlLinkExtractor(allow=('products/./index\.html', )), callback='parse_page'),) def init_request(self): return Request(url = self.init_url, callback=self

Empty variable within instance of a class, despite specifically setting it

社会主义新天地 提交于 2019-12-25 01:39:15
问题 When I run the following code: import scrapy from scrapy.crawler import CrawlerProcess class QuotesSpider(scrapy.Spider): name = "quotes" search_url = '' def start_requests(self): print ('self.search_url is currently: ' + self.search_url) yield scrapy.Request(url=self.search_url, callback=self.parse) def parse(self, response): page = response.url.split("/")[-2] filename = 'quotes-%s.html' % page with open(filename, 'wb') as f: f.write(response.body) self.log('Saved file %s' % filename)

Scrapy send stats to a URL passed as argument as a POST request every 5 minutes

三世轮回 提交于 2019-12-25 01:04:18
问题 I need to send the crawler stats to a URL which is passed on as a spider argument. I need to make a POST request at regular intervals of 5 minutes. How can I do that? 回答1: You will probably want to write an extension that simply makes a post request every 5 minutes. You can make these requests either using scrapy's own mechanisms (e.g. engine.download() ), or you can use a different async http client (e.g. treq) If you're not sure how to structure your extension, you can take a look at

Unable to rename downloaded images through pipelines without the usage of item.py

半城伤御伤魂 提交于 2019-12-25 00:20:03
问题 I've created a script using python's scrapy module to download and rename movie images from multiple pages out of a torrent site and store them in a desktop folder. When it is about downloading and storing those images in a desktop folder, my script is the same errorlessly. However, what I'm struggling to do now is rename those files on the fly. As I didn't make use of item.py file and I do not wish to either, I hardly understand how the logic of pipelines.py file would be to handle the