scrapy

Initializing pipeline object with crawler in scrapy

社会主义新天地 提交于 2019-12-20 04:11:10
问题 Based on Scrapy : Program organization when interacting with secondary website , I have: class MyPipeline(object): def __init__(self, crawler): self.crawler = crawler I'm trying to get a better understanding of the code especially the lines at the beginning listed above. Why would you initialize the pipeline object with a crawler. I have a lot of pipelines where I don't include this or any init method. What is the purpose of initializing the pipeline with a crawler? 来源: https://stackoverflow

Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback

南楼画角 提交于 2019-12-20 03:53:16
问题 Basically the code below scrapes the first 5 items of a table. One of the fields is another href and clicking on that href provides more info which I want to collect and add to the original item. So parse is supposed to pass the semi populated item to parse_next_page which then scrapes the next bit and should return the completed item back to parse Running the code below only returns the info collected in parse If I change the return items to return request I get a completed item with all 3

Scrapy installed, but won't recognized in the command line

元气小坏坏 提交于 2019-12-20 03:46:09
问题 I installed Scrapy in my python 2.7 environment in windows 7 but when I trying to start a new Scrapy project using scrapy startproject newProject the command prompt show this massage 'scrapy' is not recognized as an internal or external command, operable program or batch file. Note: I also have python 3.5 but that do not have scrapy This question is not duplicate of this 回答1: See the official documentation. Set environment variable Install pywin32 回答2: Scrapy should be in your environment

Scrapy and Django import error

孤人 提交于 2019-12-20 03:23:22
问题 When I am calling Spider through a Python script, it is giving me an ImportError : ImportError: No module named app.models My items.py is like this: from scrapy.item import Item, Field from scrapy.contrib.djangoitem import DjangoItem from app.models import Person class aqaqItem(DjangoItem): django_model=Person pass My settings.py is like this: # # For simplicity, this file contains only the most important settings by # default. All the other settings are documented here: # # http://doc.scrapy

Avoiding redirection

瘦欲@ 提交于 2019-12-20 03:02:20
问题 I'm trying to parse a site(written in ASP) and the crawler gets redirected to the main site. But what I'd like to do is to parse the given url, not the redirected one. Is there a way to do this?. I tried adding "REDIRECT=False" to the settings.py file without success. Here's some output from the crawler: 2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=500&id=500> 2011-09-24 20:01:11

Scrapy: Unable to create a project

坚强是说给别人听的谎言 提交于 2019-12-20 01:59:45
问题 I had issues installing scrapy respect to lxml but then I found some information on stackoverflow. Based on that information I did a sudo easy_install lxml with some error I think scrapy got install: Reason I came to that judgement is that I repel I could do following: Python 2.7.5 (default, Jul 28 2013, 07:27:04) [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from scrapy import * >>> But when I try

Scrapy: Unable to create a project

一曲冷凌霜 提交于 2019-12-20 01:59:22
问题 I had issues installing scrapy respect to lxml but then I found some information on stackoverflow. Based on that information I did a sudo easy_install lxml with some error I think scrapy got install: Reason I came to that judgement is that I repel I could do following: Python 2.7.5 (default, Jul 28 2013, 07:27:04) [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from scrapy import * >>> But when I try

Scrapy 模拟登陆

两盒软妹~` 提交于 2019-12-20 01:23:54
Scrapy 模拟登陆    1. 重写 爬虫中的start_requests 方法,直接携带cookies 进行登录 注意的是在scrapy 中,cookies 不能放在headers 中,而需要把cookies作为一个独立的参数。因为在scrapy配置文件中单单独定义了一个cookies配置,读取cookies 会直接从该配中进行cookies的获取。    import scrapy class RenrenSpider(scrapy.Spider): name = 'renren' # allowed_domains = ['renren.com'] start_urls = ['http://www.renren.com/467372239/profile'] #重写start_requests,携带cookie登录 def start_requests(self): # 直接携带登录后的cookies,用程序进行模拟登陆,此cookies 是手动从登录后的用户页面获取的cookies值。 cookies = "anonymid=jt79zqv32wojoo; _r01_=1; ln_uact=1970664163@qq.com; ln_hurl=http://hdn.xnimg.cn/photos/hdn521/20120626/2140/h_main_0eaI

十二 web爬虫讲解2—Scrapy框架爬虫—Scrapy模拟浏览器登录—获取Scrapy框架Cookies

痞子三分冷 提交于 2019-12-20 01:22:51
模拟浏览器登录 start_requests()方法,可以返回一个请求给爬虫的起始网站,这个返回的请求相当于start_urls,start_requests()返回的请求会替代start_urls里的请求 Request()get请求,可以设置,url、cookie、回调函数 FormRequest.from_response()表单post提交,第一个必须参数,上一次响应cookie的response对象,其他参数,cookie、url、表单内容等 yield Request()可以将一个新的请求返回给爬虫执行 在发送请求时cookie的操作, meta={'cookiejar':1}表示开启cookie记录,首次请求时写在Request()里 meta={'cookiejar':response.meta['cookiejar']}表示使用上一次response的cookie,写在FormRequest.from_response()里post授权 meta={'cookiejar':True}表示使用授权后的cookie访问需要登录查看的页面 获取 Scrapy框架Cookies 请求Cookie Cookie = response.request.headers.getlist('Cookie') print(Cookie) 响应Cookie Cookie2 =

Scrapy中使用cookie免于验证登录和模拟登录

好久不见. 提交于 2019-12-20 01:22:17
原文:https://blog.csdn.net/qq_34162294/article/details/72353397 Scrapy中使用cookie免于验证登录和模拟登录 引言 python爬虫我认为最困难的问题一个是ip代理,另外一个就是模拟登录了,更操蛋的就是模拟登录了之后还有验证码,真的是不让人省心,不过既然有了反爬虫,那么就有反反爬虫的策略,这里就先介绍一个cookie模拟登陆,后续还有seleminum+phantomjs模拟浏览器登录的文章。还不知道cookie是什么朋友们,可以点击这里 cookie提取方法: 打开谷歌浏览器或者火狐浏览器,如果是谷歌浏览器的按F12这个键就会跳出来浏览器控制台,然后点击Network,之后就是刷新网页开始抓包了,之后在抓到的页面中随便打开一个,就能看到cokie了,但是这里的cookie并不符合python中的格式,因此需要转换格式,下面提供了转换的代码 # -*- coding: utf-8 -*- class transCookie: def __init__(self, cookie): self.cookie = cookie def stringToDict(self): ''' 将从浏览器上Copy来的cookie字符串转化为Scrapy能使用的Dict :return: ''' itemDict = {}