scrapy | 易学教程

DEBUG: Crawled (404)

阅读更多关于 DEBUG: Crawled (404)

问题 This is my code: # -*- coding: utf-8 -*- import scrapy class SinasharesSpider(scrapy.Spider): name = 'SinaShares' allowed_domains = ['money.finance.sina.com.cn/mkt/'] start_urls = ['http://money.finance.sina.com.cn/mkt//'] def parse(self, response): contents=response.xpath('//*[@id="list_amount_ctrl"]/a[2]/@class').extract() print(contents) And I have set an user-agent in setting.py. Then I get an error: 2020-04-27 10:54:50 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://money.finance

Scrapy custom pipeline outputting files half the size expected

阅读更多关于 Scrapy custom pipeline outputting files half the size expected

问题 I'm trying to create a custom pipeline for a Scrapy project that outputs the collected items to CSV files. In order to keep each file's size down I want to set a maximum number of rows that each file can have. Once the line limit has been reached in the current file a new file is created to continue outputting the items. Luckily, I found a question where someone was looking to do the same thing. And there's an answer to that question that shows an example implementation. I implemented the

Scrapy custom pipeline outputting files half the size expected

阅读更多关于 Scrapy custom pipeline outputting files half the size expected

How “download_slot” works within scrapy

阅读更多关于 How “download_slot” works within scrapy

问题 I'v created a script in scrapy to parse the author name of different posts from it's landing page and then pass it to the parse_page method using meta keyword in order to print the post content along with the author name at the same time. I've used download_slot within meta keyword which allegedly maskes the script run faster. Although it is not necessary to comply with the logic I tried to apply here, I would like to stick to it only to understand how download_slot works within any script

How to send JavaScript and Cookies Enabled in Scrapy?

阅读更多关于 How to send JavaScript and Cookies Enabled in Scrapy?

问题 I am scraping a website using Scrapy which require cooking and java-script to be enabled. I don't think I will have to actually process javascript. All I need is to pretend as if javascript is enabled. Here is what I have tried: 1) Enable Cookies through following in settings COOKIES_ENABLED = True COOKIES_DEBUG = True 2) Using download middleware for cookies DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400, 'scrapy.contrib

How to send JavaScript and Cookies Enabled in Scrapy?

阅读更多关于 How to send JavaScript and Cookies Enabled in Scrapy?

How to handle a 429 Too Many Requests response in Scrapy?

阅读更多关于 How to handle a 429 Too Many Requests response in Scrapy?

问题 I'm trying to run a scraper of which the output log ends as follows: 2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 http://www.apkmirror.com/apk/instagram/instagram-instagram/instagram-instagram-9-0-0-34920-release/instagram-9-0-0-4-android-apk-download/>: HTTP status code is not handled or not allowed 2017-04-25 20:22:22 [scrapy.core.engine] INFO: Closing spider (finished) 2017-04-25 20:22:22 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {

I´m getting JavaScript code instead of rendered html content with scrapy-splash

阅读更多关于 I´m getting JavaScript code instead of rendered html content with scrapy-splash

问题 I´m trying to use scrapy-splash to load a javascript based page to get the rendered html content of the page but all I get is javascript code as a response. Why doesn´t my spider execute the javascript code of the page? this are my scrapy settings: SPLASH_URL = 'http://localhost:8050' DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, 'scrapy

Scrapy encounters DEBUG: Crawled (400)

阅读更多关于 Scrapy encounters DEBUG: Crawled (400)

问题 I'm trying to scrape the page 'https://zhuanlan.zhihu.com/wangzhenotes' with Scrapy. I run this command scrapy shell 'https://zhuanlan.zhihu.com/wangzhenotes' and got DEBUG: Crawled (400) <GET https://zhuanlan.zhihu.com/wangzhenotes> (referer: None) I guess I'm encountering some kind of anti-Scraping. How do I know what techniques the site is using? Here is the full logging (base) $ scrapy shell 'https://zhuanlan.zhihu.com/wangzhenotes' 2020-07-01 09:46:03 [scrapy.utils.log] INFO: Scrapy 2.1

Using Scrapy LinkExtractor() to locate specific domain extensions

阅读更多关于 Using Scrapy LinkExtractor() to locate specific domain extensions

问题 I want to use Scrapy's LinkExtractor() to only follow links in the .th domain I see there is a deny_extensions(list) parameter, but no allow_extensions() parameter. Given that, how do I restrict links just to allow domains in .th ? 回答1: deny_extensions is to filter out URLs ending with .gz , .exe and so on. You are probably looking for allow_domains: allow_domains (str or list) – a single value or a list of string containing domains which will be considered for extracting the links deny