scrapy | 易学教程

Scrapy FormRequest login not working

阅读更多关于 Scrapy FormRequest login not working

问题 I'm trying to log in with Scrapy but am receiving lots of "Redirecting (302)" messages. This happens when I use my real login and also with fake login info. I also tried it with another site and still no luck. import scrapy from scrapy.http import FormRequest, Request class LoginSpider(scrapy.Spider): name = 'SOlogin' allowed_domains = ['stackoverflow.com'] login_url = 'https://stackoverflow.com/users/login?ssrc=head&returnurl=http%3a%2f%2fstackoverflow.com%2f' test_url = 'http:/

Extract text from 200k domains with scrapy

阅读更多关于 Extract text from 200k domains with scrapy

问题 My problem is: I want extract all valuable text from some domain for example www.example.com. So I go to this website and visit all the links with the maximal depth 2 and write it csv file. I wrote the module in scrapy which solves this problem using 1 process and yielding multiple crawlers, but it is inefficient - I am able to crawl ~1k domains/~5k websites/h and as far as I can see my bottleneck is CPU (because of GIL?). After leaving my PC for some time I found that my network connection

how to deal with escaped_fragment using scrapy

阅读更多关于 how to deal with escaped_fragment using scrapy

问题 recently i used scrapy to scrape zoominfo then i test the below url http://subscriber.zoominfo.com/zoominfo/#!search/profile/person?personId=521850874&targetid=profile but some how in terminal, it changed like this [scrapy] DEBUG: Crawled (200) <GET http://subscriber.zoominfo.com/zoominfo/?_escaped_fragment_=search%2Fprofile%2Fperson%3FpersonId%3D521850874%26targetid%3Dprofile> I have added AJAXCRAWL_ENABLED = True in setting.py but the url still has escaped_fragment . I doubt that i haven't

Scrapy simulate XHR request - returning 400

阅读更多关于 Scrapy simulate XHR request - returning 400

问题 I'm trying to get data from a site using Ajax. The page loads and then Javascript requests the content. See this page for details: https://www.tele2.no/mobiltelefon.aspx The problem is that when i try to simulate this process by calling this url: https://www.tele2.no/Services/Webshop/FilterService.svc/ApplyPhoneFilters I get a 400 response telling me that the request is not allowed. This is my code: # -*- coding: utf-8 -*- import scrapy import json class Tele2Spider(scrapy.Spider): name =

Scrapy simulate XHR request - returning 400

阅读更多关于 Scrapy simulate XHR request - returning 400

Scrapy simulate XHR request - returning 400

阅读更多关于 Scrapy simulate XHR request - returning 400

爬虫

阅读更多关于爬虫

前戏 os.environ()简介 os.environ()可以获取到当前进程的环境变量，注意，是当前进程。如果我们在一个程序中设置了环境变量，另一个程序是无法获取设置的那个变量的。环境变量是以一个字典的形式存在的，可以用字典的方法来取值或者设置值。 os.environ() key字段详解 windows： 1 os.environ[ ' HOMEPATH ' ]:当前用户主目录。 2 os.environ[ ' TEMP ' ]:临时目录路径。 3 os.environ[PATHEXT ' ]:可执行文件。 4 os.environ[ ' SYSTEMROOT ' ]:系统主目录。 5 os.environ[ ' LOGONSERVER ' ]:机器名。 6 os.environ[ ' PROMPT ' ]:设置提示符。 linux： 1 os.environ[ ' USER ' ]:当前使用用户。 2 os.environ[ ' LC_COLLATE ' ]:路径扩展的结果排序时的字母顺序。 3 os.environ[ ' SHELL ' ]:使用shell的类型。 4 os.environ[ ' LAN ' ]:使用的语言。 5 os.environ[ ' SSH_AUTH_SOCK ' ]:ssh的执行路径。内置的方式原理

Scrapy-redis 分布式爬虫

阅读更多关于 Scrapy-redis 分布式爬虫

###Scrapy-redis 分布式爬虫 Scrapy 是一个通用的爬虫框架，但是不支持分布式，Scrapy-redis是为了更方便地实现Scrapy分布式爬取，而提供了一些以redis为基础的组件。scrapy-redis 的解决是把这个Scrapy queue换成redis数据库（也是指redis队列），从同一个redis-server存放要爬取的request，便能让多个spider去同一个数据库里读取。 ###Scrapy-Redis分布式策略： Master端(核心服务器) ：搭建一个Redis数据库，不负责爬取，只负责url指纹判重、Request的分配，以及数据的存储; Slaver端(爬虫程序执行端) ：负责执行爬虫程序，运行过程中提交新的Request给Master 首先Slaver端从Master端拿任务（Request、url）进行数据抓取，Slaver抓取数据的同时，产生新任务的Request便提交给 Master 处理； Master端只有一个Redis数据库，负责将未处理的Request去重和任务分配，将处理后的Request加入待爬队列，并且存储爬取的数据。 ####Master 安装Redis 1.下载及安装 wget http://download.redis.io/releases/redis-3.2.6.tar.gz tar xzf

Scrapy: How to stop requesting in case of 302?

阅读更多关于 Scrapy: How to stop requesting in case of 302?

问题 I am using Scrapy 2.4 to crawl specific pages from a start_urls list. Each of those URLs has persumably 6 result pages, so I request them all. In some cases however there is only 1 result page and all other paginated pages return a 302 to pn=1. In this case I do not want to follow that 302 nor do I want to continue looking for page 3,4,5,6 but rather continue to the next URL in the list. How to exit (continue) this for loop in case of a 302/301 and how to not follow that 302? def start

零基础Python学习路线图，Python学习不容错过

阅读更多关于零基础Python学习路线图，Python学习不容错过

最近有很多人在问小编Python培训方面的问题，一开始小编还挺疑惑，后来特地请教了一下度娘，果真互联网行业的风向变了，近几年Python的受欢迎程度可谓是扶摇直上，当然了学习的人也是愈来愈多。一些学习Python的小白在学习初期，总希望能够得到一份Python学习路线图，小编经过多方汇总为大家汇总了一份Python学习路线图。 Python学习路线一：Python基础必学知识：【Linux基础】【Python基础语法】【Python字符串】【文件操作】【异常处理】【Python面向对象】【项目实战】路线讲解：该路线循序渐进，科学合理，帮助学习者建立正确的编程思想，具备基本的编程能力; Python学习路线二：Python高级编程必学知识：【Python平台迁移Linux】【Python常用第三方库】【Python高级语法】【Python正则表达式】【网路编程】【系统编程】【数据结构与算法】【项目实战】路线讲解：该路线强调数据结构和算法的学习，着重提升学习者的编程核心能力;使学习者能够熟练掌握Python高级用法及网络相关知识，能够独立承担Python网络相关的开发; Python学习路线三：web前端开发必学知识：【HTML】【CSS】【UI基础】【JavaScript】【DOM】【事件】【jQuery】【混合开发】【项目实战】路线讲解