scrapy

Scrapy request not passing to callback when 301?

百般思念 提交于 2019-12-21 22:24:47
问题 I'm trying to update a database full of links to external websites, for some reason, it's skipping the callback when the request headers/website/w/e is moved/301 flag def start_requests(self): #... database stuff for x in xrange(0, numrows): row = cur.fetchone() item = exampleItem() item['real_id'] = row[0] item['product_id'] = row[1] url = "http://www.example.com/a/-" + item['real_id'] + ".htm" log.msg("item %d request URL is %s" % (item['product_id'], url), log.INFO) # shows right request =

How to clear cookies in scrapy?

落花浮王杯 提交于 2019-12-21 20:47:19
问题 By default, scrapy stores and passes cookies along requests. But how do I access or clear the stored cookies at certain point in the spider? Thanks? 回答1: to set cookies to a specific request use request cookies field for example from docs: request_with_cookies = Request(url="http://www.example.com", cookies={'currency': 'USD', 'country': 'UY'}) do access request cookies: request.headers.getlist('Cookie') response cookies: response.headers.getlist('Set-Cookie') for more details see cookies

Scrapy Extract number from page text with regex

↘锁芯ラ 提交于 2019-12-21 19:52:59
问题 I have been looking for a few hours on how to search all text on a page and if it matches a regex then extract it. I have my spider set up as follows: def parse(self, response): title = response.xpath('//title/text()').extract() units = response.xpath('//body/text()').re(r"Units: (\d)") print title, units I would like to pull out the number after "Units: " on the pages. When I run scrapy on a page with Units: 351 in the body I only get the title of the page with a bunch of escapes before and

Getting gcc failed error while installing scrapy

社会主义新天地 提交于 2019-12-21 17:42:02
问题 When i am installing scrapy then i am getting the below error ( command 'gcc' failed with exit status 1 ). I am using Centos, and yes i have the latest version of gcc installed. But i am not sure why i am getting this error. I tried googling it but could'nt find a solution OpenSSL/crypto/crypto.c: In function ‘initcrypto’: OpenSSL/crypto/crypto.c:817: warning: implicit declaration of function ‘ERR_load_crypto_strings’ OpenSSL/crypto/crypto.c:818: warning: implicit declaration of function

Memory Leak in Scrapy

谁说胖子不能爱 提交于 2019-12-21 16:55:08
问题 i wrote the following code to scrape for email addresses (for testing purposes): import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from scrapy.selector import Selector from crawler.items import EmailItem class LinkExtractorSpider(CrawlSpider): name = 'emailextractor' start_urls = ['http://news.google.com'] rules = ( Rule (LinkExtractor(), callback='process_item', follow=True),) def process_item(self, response): refer =

Scrapy: Wait for a specific url to be parsed before parsing others

别来无恙 提交于 2019-12-21 16:52:44
问题 Brief Explanation: I have a Scrapy project that takes stock data from Yahoo! Finance. In order for my project to work, I need to ensure that a stock has been around for a desired amount of time. I do this by scraping CAT (Caterpillar Inc. (CAT) -NYSE) first, get the amount of closing prices that there is for that time period, and then ensure that all stocks scraped after that have the same amount of closing prices as CAT, thus ensuring that a stock has been publicly traded for the desired

爬虫--python3如何安装scrapy?

帅比萌擦擦* 提交于 2019-12-21 14:04:59
直接使用pip3 install scrapy会报很多错误,所以试试以下步骤。 (1) https://www.lfd.uci.edu/~gohlke/pythonlibs/ 在这个python第三方库里下载三个包:分别是lxml,twisted,scrapy。【按照自己的电脑和Python版本下载相应的包】 (2) cd 到放这三个包的文件夹下,依次pip3 install 上述三个.whl文件。 例如: pip3 install Twisted-17.9.0-cp36-cp36m-win_amd64.whl pip3 install Scrapy-1.4.0-py2.py3-none-any.whl (3) 全部安装完后,直接在命令行输入scrapy,看是否出现以下提示,如果出现则说明安装成功。如果不出现,执行步骤(4) (4)如果没有成功,在下面网址上下载相应的win插件。 https://sourceforge.net/projects/pywin32/files/pywin32/Build%20220/ 安装即可。【还是按照自己电脑和Python版本下载】 来源: https://www.cnblogs.com/xubing-613/p/8108425.html

win10 安装scrapy

蹲街弑〆低调 提交于 2019-12-21 14:04:51
在win10的环境下安装scrapy,并不能直接按照官网的手册( http://doc.scrapy.org/en/1.0/intro/install.html )一次性安装成功,根据我自己的安装过程中遇到的问题,特意整理了一下安装过程 1.下载安装python2.7.11 https://www.python.org/ 2.安装完成之后,把安装路径和脚本路径添加到path中,譬如:C:\Python27\;C:\Python27\Scripts\; 3.安装pywin32,在下面的连接中下载最新版的pywin32 http://sourceforge.net/projects/pywin32/files/pywin32/ 我在安装pywin32过程中遇到错误提示: python version 2.7 required,which was not found in the registry 这是因为注册表不能识别出python2.7,解决方法就是新建一个register.py文件,运行下面的代码 # # script to register Python 2.0 or later for use with win32all # and other extensions that require Python registry settings # # written by

Scrapy \'module\' object has no attribute \'Spider\'错误

拥有回忆 提交于 2019-12-21 14:04:35
1、安装:pip3 install scrapy   结果安装失败,由于Failed building wheel for Twisted;单独安装Twisted,下载了Twisted-16.6.0-cp35-cp35m-win_amd64.whl文件   pip3 install 目录\Twisted-16.6.0-cp35-cp35m-win_amd64.whl 安装成功,再次执行pip3 install scrapy 2、新建项目   cd 项目存在路径   scrapy startproject 项目名称 Module `scrapy.conf` is deprecated, use `crawler.settings` attribute instead from scrapy.conf import settings  from scrapy.conf import crawler.settings 修改为:from scrapy.settings import Settings Module `scrapy.log` has been deprecated, Scrapy now relies on the builtin Python library for logging. Read the updated logging entry in the

Scrapy error: Microsoft Visual C++ 10.0 is required.

。_饼干妹妹 提交于 2019-12-21 14:04:16
http://blog.csdn.net/cs123951/article/details/52618873 win10 64位python3.4 使用pip install scrapy安装scrapy的时候出现错误 error: Microsoft Visual C++ 10.0 is required. Get it with "Microsoft Windows SDK 7.1": www.microsoft.com/download/details.aspx?id=8279 下载了SDK7.1也没用 因此从http://www.lfd.uci.edu/~gohlke/pythonlibs/下载scrapy的两个依赖库lxml和twisted 分别安装: pip install Twisted-16.4.1-cp34-cp34m-win_amd64.whl pip install lxml-3.6.4-cp34-cp34m-win_amd64.whl 最后安装scrapy: pip install scrapy 成功! 第一次运行scrapy项目,参照 英文版:http://doc.scrapy.org/en/latest/intro/tutorial.html#creating-a-project 中文版:http://scrapy-chs.readthedocs.io