scrapy

DEBUG: Crawled (404)

这一生的挚爱 提交于 2020-07-10 09:38:37
问题 This is my code: # -*- coding: utf-8 -*- import scrapy class SinasharesSpider(scrapy.Spider): name = 'SinaShares' allowed_domains = ['money.finance.sina.com.cn/mkt/'] start_urls = ['http://money.finance.sina.com.cn/mkt//'] def parse(self, response): contents=response.xpath('//*[@id="list_amount_ctrl"]/a[2]/@class').extract() print(contents) And I have set an user-agent in setting.py. Then I get an error: 2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://money.finance

Scrapy custom pipeline outputting files half the size expected

和自甴很熟 提交于 2020-07-10 07:09:46
问题 I'm trying to create a custom pipeline for a Scrapy project that outputs the collected items to CSV files. In order to keep each file's size down I want to set a maximum number of rows that each file can have. Once the line limit has been reached in the current file a new file is created to continue outputting the items. Luckily, I found a question where someone was looking to do the same thing. And there's an answer to that question that shows an example implementation. I implemented the

Scrapy custom pipeline outputting files half the size expected

。_饼干妹妹 提交于 2020-07-10 07:08:47
问题 I'm trying to create a custom pipeline for a Scrapy project that outputs the collected items to CSV files. In order to keep each file's size down I want to set a maximum number of rows that each file can have. Once the line limit has been reached in the current file a new file is created to continue outputting the items. Luckily, I found a question where someone was looking to do the same thing. And there's an answer to that question that shows an example implementation. I implemented the

How “download_slot” works within scrapy

混江龙づ霸主 提交于 2020-07-06 10:40:46
问题 I'v created a script in scrapy to parse the author name of different posts from it's landing page and then pass it to the parse_page method using meta keyword in order to print the post content along with the author name at the same time. I've used download_slot within meta keyword which allegedly maskes the script run faster. Although it is not necessary to comply with the logic I tried to apply here, I would like to stick to it only to understand how download_slot works within any script

How to send JavaScript and Cookies Enabled in Scrapy?

╄→尐↘猪︶ㄣ 提交于 2020-07-05 07:20:09
问题 I am scraping a website using Scrapy which require cooking and java-script to be enabled. I don't think I will have to actually process javascript. All I need is to pretend as if javascript is enabled. Here is what I have tried: 1) Enable Cookies through following in settings COOKIES_ENABLED = True COOKIES_DEBUG = True 2) Using download middleware for cookies DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400, 'scrapy.contrib

How to send JavaScript and Cookies Enabled in Scrapy?

谁都会走 提交于 2020-07-05 07:20:03
问题 I am scraping a website using Scrapy which require cooking and java-script to be enabled. I don't think I will have to actually process javascript. All I need is to pretend as if javascript is enabled. Here is what I have tried: 1) Enable Cookies through following in settings COOKIES_ENABLED = True COOKIES_DEBUG = True 2) Using download middleware for cookies DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400, 'scrapy.contrib

How to handle a 429 Too Many Requests response in Scrapy?

人盡茶涼 提交于 2020-07-04 08:03:17
问题 I'm trying to run a scraper of which the output log ends as follows: 2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 http://www.apkmirror.com/apk/instagram/instagram-instagram/instagram-instagram-9-0-0-34920-release/instagram-9-0-0-4-android-apk-download/>: HTTP status code is not handled or not allowed 2017-04-25 20:22:22 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-25 20:22:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {

I´m getting JavaScript code instead of rendered html content with scrapy-splash

二次信任 提交于 2020-07-03 17:30:08
问题 I´m trying to use scrapy-splash to load a javascript based page to get the rendered html content of the page but all I get is javascript code as a response. Why doesn´t my spider execute the javascript code of the page? this are my scrapy settings: SPLASH_URL = 'http://localhost:8050' DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 'scrapy

Scrapy encounters DEBUG: Crawled (400)

假装没事ソ 提交于 2020-07-03 13:06:04
问题 I'm trying to scrape the page 'https://zhuanlan.zhihu.com/wangzhenotes' with Scrapy. I run this command scrapy shell 'https://zhuanlan.zhihu.com/wangzhenotes' and got DEBUG: Crawled (400) <GET https://zhuanlan.zhihu.com/wangzhenotes> (referer: None) I guess I'm encountering some kind of anti-Scraping. How do I know what techniques the site is using? Here is the full logging (base) $ scrapy shell 'https://zhuanlan.zhihu.com/wangzhenotes' 2020-07-01 09:46:03 [scrapy.utils.log] INFO: Scrapy 2.1

Using Scrapy LinkExtractor() to locate specific domain extensions

和自甴很熟 提交于 2020-06-29 11:54:51
问题 I want to use Scrapy's LinkExtractor() to only follow links in the .th domain I see there is a deny_extensions(list) parameter, but no allow_extensions() parameter. Given that, how do I restrict links just to allow domains in .th ? 回答1: deny_extensions is to filter out URLs ending with .gz , .exe and so on. You are probably looking for allow_domains: allow_domains (str or list) – a single value or a list of string containing domains which will be considered for extracting the links deny