scrapy

Scrapy crawler is being blocked and gets 404

偶尔善良 提交于 2021-01-29 06:00:56
问题 I'm trying to scrape the page 'https://zhuanlan.zhihu.com/wangzhenotes' with Scrapy, with the configuration in the post and the end of this post. This command scrapy shell 'https://zhuanlan.zhihu.com/wangzhenotes' gets me 2020-07-02 05:50:04 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://zhuanlan.zhihu.com/robots.txt> (referer: None) 2020-07-02 05:50:04 [protego] DEBUG: Rule at line 19 without any user agent to enforce it on. ... 2020-07-02 05:50:04 [scrapy.core.engine] DEBUG: Crawled

Long chain of exceptions in scrapy splash application

醉酒当歌 提交于 2021-01-29 05:36:12
问题 My scrapy application is outputting this long chain of exceptions and I am failing to see what the issue is and the last one has me especially confused. Before I explain why here is the chain: 2020-11-04 17:38:58,394:ERROR:Error while obtaining start requests Traceback (most recent call last): File "C:\Users\lguarro\Anaconda3\envs\virtual_workspace\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen httplib_response = self._make_request( File "C:\Users\lguarro\Anaconda3\envs

Scrapy exporting weird symbols into csv file

你说的曾经没有我的故事 提交于 2021-01-29 04:32:14
问题 Ok, so here's the issue. I'm a beginner who has just started to delve into scrapy/python. I use the code below to scrape a website and save the results into a csv. When I look in the command prompt, it turns words like Officiële into Offici\xele. In the csv file, it changes it to officiële. I think this is because it's saving in unicode instead of UTF-8? I however have 0 clue how to change my code, and I've been trying all morning so far. Could anyone help me out here? I'm specifically

Scrapy shell works but actual script returns 404 error

淺唱寂寞╮ 提交于 2021-01-29 04:17:54
问题 scrapy shell http://www.zara.com/us Returns a correct 200 code 2017-01-05 18:34:20 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: zara) 2017-01-05 18:34:20 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'zara.spiders', 'ROBOTSTXT_OBEY': True, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['zara.spiders'], 'HTTPCACHE_ENABLED': True, 'BOT_NAME': 'zara', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'zara (+http://www.yourdomain.com)'} 2017-01-05 18

Problems with passing global variables in a Python scrapy project

寵の児 提交于 2021-01-29 03:29:37
问题 In a Scrapy project I am doing, I am having difficulties in sending a variable containing a list from one function to another. I need to do so, as I need to combine the values from one page along with another at the end of the script. The code is as follows: from scrapy.spider import Spider from scrapy.selector import Selector from scrapy.http import Request from scrapy.http.request import Request from dirbot.items import Website from scrapy.contrib.spiders import CrawlSpider,Rule from six

Scrapy: extract text with special characters

感情迁移 提交于 2021-01-28 19:33:48
问题 I'm using Scrapy for extract text from some spanish websites. Obviously, the text is written in spanish and some words have special characters like 'ñ' or 'í'. My problem is that when I run in the command line: scrapy crawl econoticia -o prueba.json to get the file with the scraped data, some characters are not shown in a proper way. For example: This is the original text "La exministra, procesada como partícipe a titulo lucrativo, intenta burlar a los fotógrafos" And this is the text scraped

How to Specify different Process settings for two different spiders in CrawlerProcess Scrapy?

房东的猫 提交于 2021-01-28 16:42:05
问题 I have two spiders which I want to execute in parallel . I used the CrawlerProcess instance and its crawl method to acheieve this. However, I want to specify different output file , ie FEED_URI for each spider in the same process . I tried to loop the spiders and run them as shown below. Though two different output files are generated, the process terminates as soon as the second spider completes execution . If the first spider completes crawling before the second one, I get the desired

Get All Spiders Class name in Scrapy

拟墨画扇 提交于 2021-01-28 12:45:49
问题 in the older version we could get the list of spiders(spider names ) with following code, but in the current version (1.4) I faced with [py.warnings] WARNING: run-all-spiders.py:17: ScrapyDeprecationWarning: CrawlerRunner.spiders attribute is renamed to CrawlerRunner.spider_loader. for spider_name in process.spiders.list(): # list all the available spiders in my project Use crawler.spiders.list() : >>> for spider_name in crawler.spiders.list(): ... print(spider_name) How Can I get spiders

Scrapy - NameError: name 'items' is not defined

浪子不回头ぞ 提交于 2021-01-28 12:21:00
问题 I'm trying to fill my Items with parsed data and I'm getting error: item = items() NameError: name 'items' is not defined** When I run scrapy crawl usa_florida_scrapper Here's my spider's code: import scrapy import re class UsaFloridaScrapperSpider(scrapy.Spider): name = 'usa_florida_scrapper' start_urls = ['https://www.txlottery.org/export/sites/lottery/Games/index.html'] def parse(self, response): item = items() print('++++++ Latest Results for Powerball ++++++++++') power_ball_html =

Passing a request to a different spider

强颜欢笑 提交于 2021-01-28 10:50:35
问题 I'm working on a web crawler (using scrapy ) that uses 2 different spiders: Very generic spider that can crawl (almost) any website using a bunch of heuristics to extract data. Specialized spider capable of crawling a particular website A that can't be crawled with a generic spider because of website's peculiar structure (that website has to be crawled). Everything works nicely so far but website A contains links to other, "ordinary" websites that should be scraped too (using spider 1). Is