scrapy | 易学教程

Scrapy crawler is being blocked and gets 404

阅读更多关于 Scrapy crawler is being blocked and gets 404

问题 I'm trying to scrape the page 'https://zhuanlan.zhihu.com/wangzhenotes' with Scrapy, with the configuration in the post and the end of this post. This command scrapy shell 'https://zhuanlan.zhihu.com/wangzhenotes' gets me 2020-07-02 05:50:04 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://zhuanlan.zhihu.com/robots.txt> (referer: None) 2020-07-02 05:50:04 [protego] DEBUG: Rule at line 19 without any user agent to enforce it on. ... 2020-07-02 05:50:04 [scrapy.core.engine] DEBUG: Crawled

Long chain of exceptions in scrapy splash application

阅读更多关于 Long chain of exceptions in scrapy splash application

问题 My scrapy application is outputting this long chain of exceptions and I am failing to see what the issue is and the last one has me especially confused. Before I explain why here is the chain: 2020-11-04 17:38:58,394:ERROR:Error while obtaining start requests Traceback (most recent call last): File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen httplib_response = self._make_request( File "C:\Users\lguarro\Anaconda3\envs

Scrapy exporting weird symbols into csv file

阅读更多关于 Scrapy exporting weird symbols into csv file

问题 Ok, so here's the issue. I'm a beginner who has just started to delve into scrapy/python. I use the code below to scrape a website and save the results into a csv. When I look in the command prompt, it turns words like Officiële into Offici\xele. In the csv file, it changes it to officiÃ«le. I think this is because it's saving in unicode instead of UTF-8? I however have 0 clue how to change my code, and I've been trying all morning so far. Could anyone help me out here? I'm specifically

Scrapy shell works but actual script returns 404 error

阅读更多关于 Scrapy shell works but actual script returns 404 error

问题 scrapy shell http://www.zara.com/us Returns a correct 200 code 2017-01-05 18:34:20 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: zara) 2017-01-05 18:34:20 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'zara.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['zara.spiders'], 'HTTPCACHE_ENABLED': True, 'BOT_NAME': 'zara', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'zara (+http://www.yourdomain.com)'} 2017-01-05 18

Problems with passing global variables in a Python scrapy project

阅读更多关于 Problems with passing global variables in a Python scrapy project

问题 In a Scrapy project I am doing, I am having difficulties in sending a variable containing a list from one function to another. I need to do so, as I need to combine the values from one page along with another at the end of the script. The code is as follows: from scrapy.spider import Spider from scrapy.selector import Selector from scrapy.http import Request from scrapy.http.request import Request from dirbot.items import Website from scrapy.contrib.spiders import CrawlSpider,Rule from six

Scrapy: extract text with special characters

阅读更多关于 Scrapy: extract text with special characters

问题 I'm using Scrapy for extract text from some spanish websites. Obviously, the text is written in spanish and some words have special characters like 'ñ' or 'í'. My problem is that when I run in the command line: scrapy crawl econoticia -o prueba.json to get the file with the scraped data, some characters are not shown in a proper way. For example: This is the original text "La exministra, procesada como partícipe a titulo lucrativo, intenta burlar a los fotógrafos" And this is the text scraped

How to Specify different Process settings for two different spiders in CrawlerProcess Scrapy?

阅读更多关于 How to Specify different Process settings for two different spiders in CrawlerProcess Scrapy?

问题 I have two spiders which I want to execute in parallel . I used the CrawlerProcess instance and its crawl method to acheieve this. However, I want to specify different output file , ie FEED_URI for each spider in the same process . I tried to loop the spiders and run them as shown below. Though two different output files are generated, the process terminates as soon as the second spider completes execution . If the first spider completes crawling before the second one, I get the desired

Get All Spiders Class name in Scrapy

阅读更多关于 Get All Spiders Class name in Scrapy

问题 in the older version we could get the list of spiders(spider names ) with following code, but in the current version (1.4) I faced with [py.warnings] WARNING: run-all-spiders.py:17: ScrapyDeprecationWarning: CrawlerRunner.spiders attribute is renamed to CrawlerRunner.spider_loader. for spider_name in process.spiders.list(): # list all the available spiders in my project Use crawler.spiders.list() : >>> for spider_name in crawler.spiders.list(): ... print(spider_name) How Can I get spiders

Scrapy - NameError: name 'items' is not defined

阅读更多关于 Scrapy - NameError: name 'items' is not defined

问题 I'm trying to fill my Items with parsed data and I'm getting error: item = items() NameError: name 'items' is not defined** When I run scrapy crawl usa_florida_scrapper Here's my spider's code: import scrapy import re class UsaFloridaScrapperSpider(scrapy.Spider): name = 'usa_florida_scrapper' start_urls = ['https://www.txlottery.org/export/sites/lottery/Games/index.html'] def parse(self, response): item = items() print('++++++ Latest Results for Powerball ++++++++++') power_ball_html =

Passing a request to a different spider

阅读更多关于 Passing a request to a different spider

问题 I'm working on a web crawler (using scrapy ) that uses 2 different spiders: Very generic spider that can crawl (almost) any website using a bunch of heuristics to extract data. Specialized spider capable of crawling a particular website A that can't be crawled with a generic spider because of website's peculiar structure (that website has to be crawled). Everything works nicely so far but website A contains links to other, "ordinary" websites that should be scraped too (using spider 1). Is