scrapy

Is it OK for Scrapy's request_fingerprint method to return None?

与世无争的帅哥 提交于 2019-12-25 16:59:29
问题 I'd like to override Scrapy's default RFPDupefilter class as follows: from scrapy.dupefilters import RFPDupeFilter class URLDupefilter(RFPDupeFilter): def request_fingerprint(self, request): if not request.url.endswith('.xml'): return request.url The rationale is that I would like to make the requests.seen 'human-readable' by using the scraped URLs (which are sufficiently unique) rather than a hash. However, I would like to omit URLs ending with .xml (which correspond to sitemap pages). Like

基于scrapy框架输入关键字爬取有关贴吧帖子

耗尽温柔 提交于 2019-12-25 16:33:52
基于scrapy框架输入关键字爬取有关贴吧帖子 学习途中的记录与分享,scrapy框架的学习,求大佬对我的不足有所指点 站点分析 首先进入一个贴吧,要想达到输入关键词爬取爬取指定贴吧,必然需要利用搜索引擎 点进看到有四种搜索方式,分别试一次,观察url变化 我们得知: 搜索贴吧:http://tieba.baidu.com/f/search/fm?ie=UTF-8&qw=dfd 搜索帖子:http://tieba.baidu.com/f/search/res?ie=utf-8&qw=dfd 其中参数qw是搜索关键词,由此我们可以构建搜索贴吧的url 搜索得到的页面,可以得到我们需要的贴吧url 我们就可以轻而易举的得到我们搜索的相关贴吧 下面对贴吧主页进行分析 进入贴吧F12查看 显然我们知道#thread_list这个列表,观察看到这就是每个贴在,注意li标签里的data-field字段有我们需要的信息, 不过我们只需要得到帖子的url,之后对帖子进一步提取,其中data-tid就是贴子的id,通过这个我们可以定位唯一的帖子 如data-tid=“6410699527”, 则帖子的url为teiba.baidu.com/p/6410699527具体的探索过程就不一一阐述了。。。 对帖子分析 直接源码省去很多字、、、的寻找过程,我们在源码找到了一段JavaScript代码

Learning Scrapy《精通Python爬虫框架Scrapy》 03:Scrapy的工作流程

佐手、 提交于 2019-12-25 12:49:04
个人觉得,本书中所讲的内容都是希望读者不求甚解,只需要了解一些用法就行。可惜博主刚好不是,总想把这些问题的原因搞清楚,比如Scrapy的工作流程,为什么我们一上手就要添加item,然后就直接解析数据了?在没搞清楚工作机制的情况下,满脑子都是浆糊。于是Read the f*cking document.( https://docs.scrapy.org/en/latest/topics/architecture.html ) Scrapy的组件 Scrapy Engine(引擎):核心组件,用于控制所有组件的数据流,和触发事件。 Scheduler(调度器):接收引擎过来的请求、压入队列,并在引擎再次请求的时候返回。 Downloader(下载器):发送请求到url,接收服务器响应并返回到爬虫。 Spiders(爬虫):解析响应数据,并提取所需要的数据为条目(items)。 Item Pipeline(条目管道):处理爬虫从网页中抽取的条目,主要的功能是持久化条目、验证条目的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到条目管道,并经过几个特定的次序处理数据。 Downloader middlewares(下载器中间件):是引擎与下载器之间的框架,主要是处理引擎与下载器之间的请求及响应。 Spider middlewares(爬虫中间件):引擎和爬虫之间的框架

Python - how to add response from scrapy.request from yield into an array

时光怂恿深爱的人放手 提交于 2019-12-25 09:48:46
问题 I am trying to collect populations of different sovereigns from wiki list of sovereigns and add them to an array on each response. In the code below allList should have a list of dicts with name of the country in ['nation'] and the population in ['demographics']. Many thanks. # -*- coding: utf-8 -*- import scrapy import logging import csv import pprint class CrawlerSpider(scrapy.Spider): name = 'test2Crawler' allowed_domains = ['web'] start_urls = ['https://en.wikipedia.org/wiki/List_of

Python - how to add response from scrapy.request from yield into an array

微笑、不失礼 提交于 2019-12-25 09:48:45
问题 I am trying to collect populations of different sovereigns from wiki list of sovereigns and add them to an array on each response. In the code below allList should have a list of dicts with name of the country in ['nation'] and the population in ['demographics']. Many thanks. # -*- coding: utf-8 -*- import scrapy import logging import csv import pprint class CrawlerSpider(scrapy.Spider): name = 'test2Crawler' allowed_domains = ['web'] start_urls = ['https://en.wikipedia.org/wiki/List_of

How can I avoid JSON percent-encoding and \u-escaping?

时间秒杀一切 提交于 2019-12-25 09:30:13
问题 When I parse the file <html> <head><meta charset="UTF-8"></head> <body><a href="Düsseldorf.html">Düsseldorf</a></body> </html> using item = SimpleItem() item['name'] = response.xpath('//a/text()')[0].extract() item["url"] = response.xpath('//a/@href')[0].extract() return item I end up with either \u escapes [{ "name": "D\u00fcsseldorf", "url": "D\u00fcsseldorf.html" }] or with percent-encoded strings D%C3%BCsseldorf The item exporter described here # -*- coding: utf-8 -*- import json from

How to append items from scrapy spider to list?

爷,独闯天下 提交于 2019-12-25 08:49:20
问题 I'm using a basic spider that gets particular information from links on a website. My code looks like this: import sys from scrapy import Request import urllib.parse as urlparse from properties import PropertiesItem, ItemLoader from scrapy.crawler import CrawlerProcess class BasicSpider(scrapy.Spider): name = "basic" allowed_domains = ["web"] start_urls = ['www.example.com'] objectList = [] def parse(self, response): # Get item URLs and yield Requests item_selector = response.xpath('//*[

scrapy-splash usage for rendering javascript

回眸只為那壹抹淺笑 提交于 2019-12-25 08:24:45
问题 This is a follow up of my previous quesion I installed splash and scrapy-splash. And also followed the instructions for scrapy-splash. I edited my code as follows: import scrapy from scrapy_splash import SplashRequest class CityDataSpider(scrapy.Spider): name = "citydata" def start_requests(self): urls = [ 'http://www.city-data.com/advanced/search.php#body?fips=0&csize=a&sc=2&sd=0&states=ALL&near=&nam_crit1=6914&b6914=MIN&e6914=MAX&i6914=1&nam_crit2=6819&b6819=15500&e6819=MAX&i6819=1&ps=20&p

Scrapy - offsite request to be processed based on a regex

我只是一个虾纸丫 提交于 2019-12-25 08:16:22
问题 I have to crawl 5-6 domains. I wanted to write a the crawler as such that the offsite requests if contains a some substrings example set as [ aaa,bbb,ccc] if the offsite url contain a substring from the above set then it should be processed and not filter out. Should i write a custom middleware or can i just use regular expression in the allowed domains. 回答1: Offsite middleware already uses regex by default, however it's no exposed. It compiles the domains you provide into regex, but the

Scrapy Script called via shell_exec doesn't perform

末鹿安然 提交于 2019-12-25 07:58:27
问题 I have a scrapy spider on this path: define("SPIDER_PATH", "C:\\Users\\[USERNAME]\\test1\\test1\\spiders\\test.py"); Now I try to launch the script via php: if (is_numeric(filter_input(INPUT_POST, "reload"))) { $additional = " -a check=" . filter_input(INPUT_POST, "reload"); } echo shell_exec("scrapy runspider " . SPIDER_PATH . $additional); But nothing happens and there is nothing echoed from shell_exec. I've tested it on a local machine using wamp. Can anyone help me? The enviroment