scrapy | 易学教程

Initializing pipeline object with crawler in scrapy

阅读更多关于 Initializing pipeline object with crawler in scrapy

问题 Based on Scrapy : Program organization when interacting with secondary website , I have: class MyPipeline(object): def __init__(self, crawler): self.crawler = crawler I'm trying to get a better understanding of the code especially the lines at the beginning listed above. Why would you initialize the pipeline object with a crawler. I have a lot of pipelines where I don't include this or any init method. What is the purpose of initializing the pipeline with a crawler? 来源： https://stackoverflow

Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback

阅读更多关于 Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback

问题 Basically the code below scrapes the first 5 items of a table. One of the fields is another href and clicking on that href provides more info which I want to collect and add to the original item. So parse is supposed to pass the semi populated item to parse_next_page which then scrapes the next bit and should return the completed item back to parse Running the code below only returns the info collected in parse If I change the return items to return request I get a completed item with all 3

Scrapy installed, but won't recognized in the command line

阅读更多关于 Scrapy installed, but won't recognized in the command line

问题 I installed Scrapy in my python 2.7 environment in windows 7 but when I trying to start a new Scrapy project using scrapy startproject newProject the command prompt show this massage 'scrapy' is not recognized as an internal or external command, operable program or batch file. Note: I also have python 3.5 but that do not have scrapy This question is not duplicate of this 回答1: See the official documentation. Set environment variable Install pywin32 回答2: Scrapy should be in your environment

Scrapy and Django import error

阅读更多关于 Scrapy and Django import error

问题 When I am calling Spider through a Python script, it is giving me an ImportError : ImportError: No module named app.models My items.py is like this: from scrapy.item import Item, Field from scrapy.contrib.djangoitem import DjangoItem from app.models import Person class aqaqItem(DjangoItem): django_model=Person pass My settings.py is like this: # # For simplicity, this file contains only the most important settings by # default. All the other settings are documented here: # # http://doc.scrapy

Avoiding redirection

阅读更多关于 Avoiding redirection

问题 I'm trying to parse a site(written in ASP) and the crawler gets redirected to the main site. But what I'd like to do is to parse the given url, not the redirected one. Is there a way to do this?. I tried adding "REDIRECT=False" to the settings.py file without success. Here's some output from the crawler: 2011-09-24 20:01:11-0300 [coto] DEBUG: Redirecting (302) to <GET http://www.cotodigital.com.ar/default.asp> from <GET http://www.cotodigital.com.ar/l.asp?cat=500&id=500> 2011-09-24 20:01:11

Scrapy: Unable to create a project

阅读更多关于 Scrapy: Unable to create a project

问题 I had issues installing scrapy respect to lxml but then I found some information on stackoverflow. Based on that information I did a sudo easy_install lxml with some error I think scrapy got install: Reason I came to that judgement is that I repel I could do following: Python 2.7.5 (default, Jul 28 2013, 07:27:04) [GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from scrapy import * >>> But when I try

Scrapy: Unable to create a project

阅读更多关于 Scrapy: Unable to create a project

Scrapy 模拟登陆

阅读更多关于 Scrapy 模拟登陆

Scrapy 模拟登陆　　 1. 重写爬虫中的start_requests 方法，直接携带cookies 进行登录注意的是在scrapy 中，cookies 不能放在headers 中，而需要把cookies作为一个独立的参数。因为在scrapy配置文件中单单独定义了一个cookies配置，读取cookies 会直接从该配中进行cookies的获取。　　 import scrapy class RenrenSpider(scrapy.Spider): name = 'renren' # allowed_domains = ['renren.com'] start_urls = ['http://www.renren.com/467372239/profile'] #重写start_requests,携带cookie登录 def start_requests(self): # 直接携带登录后的cookies，用程序进行模拟登陆，此cookies 是手动从登录后的用户页面获取的cookies值。 cookies = "anonymid=jt79zqv32wojoo; _r01_=1; ln_uact=1970664163@qq.com; ln_hurl=http://hdn.xnimg.cn/photos/hdn521/20120626/2140/h_main_0eaI

十二 web爬虫讲解2—Scrapy框架爬虫—Scrapy模拟浏览器登录—获取Scrapy框架Cookies

阅读更多关于十二 web爬虫讲解2—Scrapy框架爬虫—Scrapy模拟浏览器登录—获取Scrapy框架Cookies

模拟浏览器登录 start_requests()方法，可以返回一个请求给爬虫的起始网站，这个返回的请求相当于start_urls，start_requests()返回的请求会替代start_urls里的请求 Request()get请求，可以设置，url、cookie、回调函数 FormRequest.from_response()表单post提交，第一个必须参数，上一次响应cookie的response对象，其他参数，cookie、url、表单内容等 yield Request()可以将一个新的请求返回给爬虫执行在发送请求时cookie的操作， meta={'cookiejar':1}表示开启cookie记录，首次请求时写在Request()里 meta={'cookiejar':response.meta['cookiejar']}表示使用上一次response的cookie，写在FormRequest.from_response()里post授权 meta={'cookiejar':True}表示使用授权后的cookie访问需要登录查看的页面获取 Scrapy框架Cookies 请求Cookie Cookie = response.request.headers.getlist('Cookie') print(Cookie) 响应Cookie Cookie2 =

Scrapy中使用cookie免于验证登录和模拟登录

阅读更多关于 Scrapy中使用cookie免于验证登录和模拟登录

原文:https://blog.csdn.net/qq_34162294/article/details/72353397 Scrapy中使用cookie免于验证登录和模拟登录引言 python爬虫我认为最困难的问题一个是ip代理，另外一个就是模拟登录了，更操蛋的就是模拟登录了之后还有验证码，真的是不让人省心，不过既然有了反爬虫，那么就有反反爬虫的策略，这里就先介绍一个cookie模拟登陆，后续还有seleminum+phantomjs模拟浏览器登录的文章。还不知道cookie是什么朋友们，可以点击这里 cookie提取方法：打开谷歌浏览器或者火狐浏览器，如果是谷歌浏览器的按F12这个键就会跳出来浏览器控制台，然后点击Network，之后就是刷新网页开始抓包了，之后在抓到的页面中随便打开一个，就能看到cokie了，但是这里的cookie并不符合python中的格式，因此需要转换格式，下面提供了转换的代码 # -*- coding: utf-8 -*- class transCookie: def __init__(self, cookie): self.cookie = cookie def stringToDict(self): ''' 将从浏览器上Copy来的cookie字符串转化为Scrapy能使用的Dict :return: ''' itemDict = {}