scrapy

Scrapy FormRequest login not working

僤鯓⒐⒋嵵緔 提交于 2021-02-08 07:51:41
问题 I'm trying to log in with Scrapy but am receiving lots of "Redirecting (302)" messages. This happens when I use my real login and also with fake login info. I also tried it with another site and still no luck. import scrapy from scrapy.http import FormRequest, Request class LoginSpider(scrapy.Spider): name = 'SOlogin' allowed_domains = ['stackoverflow.com'] login_url = 'https://stackoverflow.com/users/login?ssrc=head&returnurl=http%3a%2f%2fstackoverflow.com%2f' test_url = 'http:/

Extract text from 200k domains with scrapy

喜欢而已 提交于 2021-02-08 07:51:28
问题 My problem is: I want extract all valuable text from some domain for example www.example.com. So I go to this website and visit all the links with the maximal depth 2 and write it csv file. I wrote the module in scrapy which solves this problem using 1 process and yielding multiple crawlers, but it is inefficient - I am able to crawl ~1k domains/~5k websites/h and as far as I can see my bottleneck is CPU (because of GIL?). After leaving my PC for some time I found that my network connection

how to deal with escaped_fragment using scrapy

不想你离开。 提交于 2021-02-08 07:35:23
问题 recently i used scrapy to scrape zoominfo then i test the below url http://subscriber.zoominfo.com/zoominfo/#!search/profile/person?personId=521850874&targetid=profile but some how in terminal, it changed like this [scrapy] DEBUG: Crawled (200) <GET http://subscriber.zoominfo.com/zoominfo/?_escaped_fragment_=search%2Fprofile%2Fperson%3FpersonId%3D521850874%26targetid%3Dprofile> I have added AJAXCRAWL_ENABLED = True in setting.py but the url still has escaped_fragment . I doubt that i haven't

Scrapy simulate XHR request - returning 400

扶醉桌前 提交于 2021-02-08 06:59:51
问题 I'm trying to get data from a site using Ajax. The page loads and then Javascript requests the content. See this page for details: https://www.tele2.no/mobiltelefon.aspx The problem is that when i try to simulate this process by calling this url: https://www.tele2.no/Services/Webshop/FilterService.svc/ApplyPhoneFilters I get a 400 response telling me that the request is not allowed. This is my code: # -*- coding: utf-8 -*- import scrapy import json class Tele2Spider(scrapy.Spider): name =

Scrapy simulate XHR request - returning 400

☆樱花仙子☆ 提交于 2021-02-08 06:59:22
问题 I'm trying to get data from a site using Ajax. The page loads and then Javascript requests the content. See this page for details: https://www.tele2.no/mobiltelefon.aspx The problem is that when i try to simulate this process by calling this url: https://www.tele2.no/Services/Webshop/FilterService.svc/ApplyPhoneFilters I get a 400 response telling me that the request is not allowed. This is my code: # -*- coding: utf-8 -*- import scrapy import json class Tele2Spider(scrapy.Spider): name =

Scrapy simulate XHR request - returning 400

梦想的初衷 提交于 2021-02-08 06:57:46
问题 I'm trying to get data from a site using Ajax. The page loads and then Javascript requests the content. See this page for details: https://www.tele2.no/mobiltelefon.aspx The problem is that when i try to simulate this process by calling this url: https://www.tele2.no/Services/Webshop/FilterService.svc/ApplyPhoneFilters I get a 400 response telling me that the request is not allowed. This is my code: # -*- coding: utf-8 -*- import scrapy import json class Tele2Spider(scrapy.Spider): name =

爬虫

一个人想着一个人 提交于 2021-02-08 06:17:12
前戏 os.environ()简介 os.environ()可以获取到当前进程的环境变量,注意,是当前进程。 如果我们在一个程序中设置了环境变量,另一个程序是无法获取设置的那个变量的。 环境变量是以一个字典的形式存在的,可以用字典的方法来取值或者设置值。 os.environ() key字段详解 windows: 1 os.environ[ ' HOMEPATH ' ]:当前用户主目录。 2 os.environ[ ' TEMP ' ]:临时目录路径。 3 os.environ[PATHEXT ' ]:可执行文件。 4 os.environ[ ' SYSTEMROOT ' ]:系统主目录。 5 os.environ[ ' LOGONSERVER ' ]:机器名。 6 os.environ[ ' PROMPT ' ]:设置提示符。 linux: 1 os.environ[ ' USER ' ]:当前使用用户。 2 os.environ[ ' LC_COLLATE ' ]:路径扩展的结果排序时的字母顺序。 3 os.environ[ ' SHELL ' ]:使用shell的类型。 4 os.environ[ ' LAN ' ]:使用的语言。 5 os.environ[ ' SSH_AUTH_SOCK ' ]:ssh的执行路径。 内置的方式 原理

Scrapy-redis 分布式爬虫

落爺英雄遲暮 提交于 2021-02-08 05:31:44
###Scrapy-redis 分布式爬虫 Scrapy 是一个通用的爬虫框架,但是不支持分布式,Scrapy-redis是为了更方便地实现Scrapy分布式爬取,而提供了一些以redis为基础的组件。scrapy-redis 的解决是把这个Scrapy queue换成redis数据库(也是指redis队列),从同一个redis-server存放要爬取的request,便能让多个spider去同一个数据库里读取。 ###Scrapy-Redis分布式策略: Master端(核心服务器) :搭建一个Redis数据库,不负责爬取,只负责url指纹判重、Request的分配,以及数据的存储; Slaver端(爬虫程序执行端) :负责执行爬虫程序,运行过程中提交新的Request给Master 首先Slaver端从Master端拿任务(Request、url)进行数据抓取,Slaver抓取数据的同时,产生新任务的Request便提交给 Master 处理; Master端只有一个Redis数据库,负责将未处理的Request去重和任务分配,将处理后的Request加入待爬队列,并且存储爬取的数据。 ####Master 安装Redis 1.下载及安装 wget http://download.redis.io/releases/redis-3.2.6.tar.gz tar xzf

Scrapy: How to stop requesting in case of 302?

流过昼夜 提交于 2021-02-08 02:58:29
问题 I am using Scrapy 2.4 to crawl specific pages from a start_urls list. Each of those URLs has persumably 6 result pages, so I request them all. In some cases however there is only 1 result page and all other paginated pages return a 302 to pn=1. In this case I do not want to follow that 302 nor do I want to continue looking for page 3,4,5,6 but rather continue to the next URL in the list. How to exit (continue) this for loop in case of a 302/301 and how to not follow that 302? def start

零基础Python学习路线图,Python学习不容错过

不羁岁月 提交于 2021-02-07 21:35:46
最近有很多人在问小编Python培训方面的问题,一开始小编还挺疑惑,后来特地请教了一下度娘,果真互联网行业的风向变了,近几年Python的受欢迎程度可谓是扶摇直上,当然了学习的人也是愈来愈多。一些学习Python的小白在学习初期,总希望能够得到一份Python学习路线图,小编经过多方汇总为大家汇总了一份Python学习路线图。 Python学习路线一:Python基础 必学知识:【Linux基础】【Python基础语法】【Python字符串】【文件操作】【异常处理】【Python面向对象】【项目实战】 路线讲解:该路线循序渐进,科学合理,帮助学习者建立正确的编程思想,具备基本的编程能力; Python学习路线二:Python高级编程 必学知识:【Python平台迁移Linux】【Python常用第三方库】【Python高级语法】【Python正则表达式】【网路编程】【系统编程】【数据结构与算法】【项目实战】 路线讲解:该路线强调数据结构和算法的学习,着重提升学习者的编程核心能力;使学习者能够熟练掌握Python高级用法及网络相关知识,能够独立承担Python网络相关的开发; Python学习路线三:web前端开发 必学知识:【HTML】【CSS】【UI基础】【JavaScript】【DOM】【事件】【jQuery】【混合开发】【项目实战】 路线讲解