scrapy | 易学教程

Is it OK for Scrapy's request_fingerprint method to return None?

阅读更多关于 Is it OK for Scrapy's request_fingerprint method to return None?

问题 I'd like to override Scrapy's default RFPDupefilter class as follows: from scrapy.dupefilters import RFPDupeFilter class URLDupefilter(RFPDupeFilter): def request_fingerprint(self, request): if not request.url.endswith('.xml'): return request.url The rationale is that I would like to make the requests.seen 'human-readable' by using the scraped URLs (which are sufficiently unique) rather than a hash. However, I would like to omit URLs ending with .xml (which correspond to sitemap pages). Like

基于scrapy框架输入关键字爬取有关贴吧帖子

阅读更多关于基于scrapy框架输入关键字爬取有关贴吧帖子

基于scrapy框架输入关键字爬取有关贴吧帖子学习途中的记录与分享，scrapy框架的学习，求大佬对我的不足有所指点站点分析首先进入一个贴吧，要想达到输入关键词爬取爬取指定贴吧，必然需要利用搜索引擎点进看到有四种搜索方式，分别试一次，观察url变化我们得知：搜索贴吧：http://tieba.baidu.com/f/search/fm?ie=UTF-8&qw=dfd 搜索帖子：http://tieba.baidu.com/f/search/res?ie=utf-8&qw=dfd 其中参数qw是搜索关键词，由此我们可以构建搜索贴吧的url 搜索得到的页面，可以得到我们需要的贴吧url 我们就可以轻而易举的得到我们搜索的相关贴吧下面对贴吧主页进行分析进入贴吧F12查看显然我们知道#thread_list这个列表，观察看到这就是每个贴在，注意li标签里的data-field字段有我们需要的信息, 不过我们只需要得到帖子的url，之后对帖子进一步提取，其中data-tid就是贴子的id，通过这个我们可以定位唯一的帖子如data-tid=“6410699527”, 则帖子的url为teiba.baidu.com/p/6410699527具体的探索过程就不一一阐述了。。。对帖子分析直接源码省去很多字、、、的寻找过程，我们在源码找到了一段JavaScript代码

Learning Scrapy《精通Python爬虫框架Scrapy》 03：Scrapy的工作流程

阅读更多关于 Learning Scrapy《精通Python爬虫框架Scrapy》 03：Scrapy的工作流程

个人觉得，本书中所讲的内容都是希望读者不求甚解，只需要了解一些用法就行。可惜博主刚好不是，总想把这些问题的原因搞清楚，比如Scrapy的工作流程，为什么我们一上手就要添加item，然后就直接解析数据了？在没搞清楚工作机制的情况下，满脑子都是浆糊。于是Read the f*cking document.( https://docs.scrapy.org/en/latest/topics/architecture.html ) Scrapy的组件 Scrapy Engine（引擎）：核心组件，用于控制所有组件的数据流，和触发事件。 Scheduler（调度器）：接收引擎过来的请求、压入队列，并在引擎再次请求的时候返回。 Downloader（下载器）：发送请求到url，接收服务器响应并返回到爬虫。 Spiders（爬虫）：解析响应数据，并提取所需要的数据为条目（items）。 Item Pipeline（条目管道）：处理爬虫从网页中抽取的条目，主要的功能是持久化条目、验证条目的有效性、清除不需要的信息。当页面被爬虫解析后，将被发送到条目管道，并经过几个特定的次序处理数据。 Downloader middlewares（下载器中间件）：是引擎与下载器之间的框架，主要是处理引擎与下载器之间的请求及响应。 Spider middlewares（爬虫中间件）：引擎和爬虫之间的框架

Python - how to add response from scrapy.request from yield into an array

阅读更多关于 Python - how to add response from scrapy.request from yield into an array

问题 I am trying to collect populations of different sovereigns from wiki list of sovereigns and add them to an array on each response. In the code below allList should have a list of dicts with name of the country in ['nation'] and the population in ['demographics']. Many thanks. # -*- coding: utf-8 -*- import scrapy import logging import csv import pprint class CrawlerSpider(scrapy.Spider): name = 'test2Crawler' allowed_domains = ['web'] start_urls = ['https://en.wikipedia.org/wiki/List_of

Python - how to add response from scrapy.request from yield into an array

阅读更多关于 Python - how to add response from scrapy.request from yield into an array

How can I avoid JSON percent-encoding and \u-escaping?

阅读更多关于 How can I avoid JSON percent-encoding and \u-escaping?

问题 When I parse the file <html> <head><meta charset="UTF-8"></head> <body><a href="Düsseldorf.html">Düsseldorf</a></body> </html> using item = SimpleItem() item['name'] = response.xpath('//a/text()')[0].extract() item["url"] = response.xpath('//a/@href')[0].extract() return item I end up with either \u escapes [{ "name": "D\u00fcsseldorf", "url": "D\u00fcsseldorf.html" }] or with percent-encoded strings D%C3%BCsseldorf The item exporter described here # -*- coding: utf-8 -*- import json from

How to append items from scrapy spider to list?

阅读更多关于 How to append items from scrapy spider to list?

问题 I'm using a basic spider that gets particular information from links on a website. My code looks like this: import sys from scrapy import Request import urllib.parse as urlparse from properties import PropertiesItem, ItemLoader from scrapy.crawler import CrawlerProcess class BasicSpider(scrapy.Spider): name = "basic" allowed_domains = ["web"] start_urls = ['www.example.com'] objectList = [] def parse(self, response): # Get item URLs and yield Requests item_selector = response.xpath('//*[

scrapy-splash usage for rendering javascript

阅读更多关于 scrapy-splash usage for rendering javascript

问题 This is a follow up of my previous quesion I installed splash and scrapy-splash. And also followed the instructions for scrapy-splash. I edited my code as follows: import scrapy from scrapy_splash import SplashRequest class CityDataSpider(scrapy.Spider): name = "citydata" def start_requests(self): urls = [ 'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p

Scrapy - offsite request to be processed based on a regex

阅读更多关于 Scrapy - offsite request to be processed based on a regex

问题 I have to crawl 5-6 domains. I wanted to write a the crawler as such that the offsite requests if contains a some substrings example set as [ aaa,bbb,ccc] if the offsite url contain a substring from the above set then it should be processed and not filter out. Should i write a custom middleware or can i just use regular expression in the allowed domains. 回答1: Offsite middleware already uses regex by default, however it's no exposed. It compiles the domains you provide into regex, but the

Scrapy Script called via shell_exec doesn't perform

阅读更多关于 Scrapy Script called via shell_exec doesn't perform

问题 I have a scrapy spider on this path: define("SPIDER_PATH", "C:\\Users\\[USERNAME]\\test1\\test1\\spiders\\test.py"); Now I try to launch the script via php: if (is_numeric(filter_input(INPUT_POST, "reload"))) { $additional = " -a check=" . filter_input(INPUT_POST, "reload"); } echo shell_exec("scrapy runspider " . SPIDER_PATH . $additional); But nothing happens and there is nothing echoed from shell_exec. I've tested it on a local machine using wamp. Can anyone help me? The enviroment

订阅 scrapy