scrapy | 易学教程

Scrapy中如何向Spider传入参数

阅读更多关于 Scrapy中如何向Spider传入参数

目录方式一方式二 settings.py run.py pipelines.py 启动示例在使用Scrapy爬取数据时，有时会碰到需要根据传递给Spider的参数来决定爬取哪些Url或者爬取哪些页的情况。例如，百度贴吧的放置奇兵吧的地址如下，其中 kw参数用来指定贴吧名称、pn参数用来对帖子进行翻页。 https://tieba.baidu.com/f?kw=放置奇兵&ie=utf-8&pn=250 如果我们希望通过参数传递的方式将贴吧名称和页数等参数传给Spider，来控制我们要爬取哪一个贴吧、爬取哪些页。遇到这种情况，有以下两种方法向Spider传递参数。方式一通过 scrapy crawl 命令的 -a 参数向 spider 传递参数。 # -*- coding: utf-8 -*- import scrapy class TiebaSpider(scrapy.Spider): name = 'tieba' # 贴吧爬虫 allowed_domains = ['tieba.baidu.com'] # 允许爬取的范围 start_urls = [] # 爬虫起始地址 # 命令格式： scrapy crawl tieba -a tiebaName=放置奇兵 -a pn=250 def __init__(self, tiebaName=None, pn=None,

Guidance for building a Google chrome extension in Python

阅读更多关于 Guidance for building a Google chrome extension in Python

问题 I would like to build a chrome extension in python(because Javascript would be completely new to me!). There are multiple things that I want to do and finding a hard time wrapping my head around to understand if it is possible and what tools would be needed. The extension would display the results of web scraping(using Scrapy) a page. Scraping the page and updating the results on the extension need to be done every 5 minutes. I do know that Pyjs is one of the options, but it isn't clear to me

How to recursively scrape every link from a site using Scrapy?

阅读更多关于 How to recursively scrape every link from a site using Scrapy?

问题 I'm trying to obtain every single link (and no other data) from a website using Scrapy. I want to do this by starting at the homepage, scraping all the links from there, then for each link found, follow the link and scrape all (unique) links from that page, and do this for all links found until there are no more to follow. I also have to enter a username and password to get into each page on the site, so I've included a basic authentication component to my start_requests. So far I have a

Different parse function for different start_urls in Scrapy

阅读更多关于 Different parse function for different start_urls in Scrapy

问题 Can Scrapy set different parse function for every start_urls? This is the piece of pseudo-code: start_urls = [ "http://111sssssssss.com", "http://222sssssssssssss.com", "http://333sssssssssss.com", "http://444sssssssss.com", ] def parse_1(): '''some code, this function will crawl http://111sssssssss.com''' def parse_2(): '''some code, this function will crawl http://222sssssssssssss.com''' Is there any way to do that? 回答1: You can override / implement the parse_start_url function and there

Why is my scrapy spider not following the Request callback in my item parse function?

阅读更多关于 Why is my scrapy spider not following the Request callback in my item parse function?

问题 I'm scraping a site to check in-stock status of various products. Unfortunately this requires actually clicking "Add to Cart" on the product page and checking the next page's message to determine if stock is available (i.e. it requires parsing two responses). I followed the excellent documentation for this scenario and wrote my parse function to return a Request object with a callback to my secondary parse function. However, this function rarely gets called. Most products result in only

Scrapy. How to change spider settings after start crawling?

阅读更多关于 Scrapy. How to change spider settings after start crawling?

问题 I can't change spider settings in parse method. But it is definitely must be a way. For example: class SomeSpider(BaseSpider): name = 'mySpider' allowed_domains = ['example.com'] start_urls = ['http://example.com'] settings.overrides['ITEM_PIPELINES'] = ['myproject.pipelines.FirstPipeline'] print settings['ITEM_PIPELINES'][0] #printed 'myproject.pipelines.FirstPipeline' def parse(self, response): #...some code settings.overrides['ITEM_PIPELINES'] = ['myproject.pipelines.SecondPipeline'] print

Scrapy. How to change spider settings after start crawling?

阅读更多关于 Scrapy. How to change spider settings after start crawling?

How to prevent Scrapy from URL encoding request URLs

阅读更多关于 How to prevent Scrapy from URL encoding request URLs

问题 I would like Scrapy to not URL encode my Requests. I see that scrapy.http.Request is importing scrapy.utils.url which imports w3lib.url which contains the variable _ALWAYS_SAFE_BYTES. I just need to add a set of characters to _ALWAYS_SAFE_BYTES but I am not sure how to do that from within my spider class. scrapy.http.Request relevant line: fp.update(canonicalize_url(request.url)) canonicalize_url is from scrapy.utils.url, relevant line in scrapy.utils.url: path = safe_url_string(_unquotepath

【环境配置】出现：Microsoft Visual C++ 14.0 is required 的解决方案

阅读更多关于【环境配置】出现：Microsoft Visual C++ 14.0 is required 的解决方案

以安装scrapy为例：如：pip install scrapy 时出现： error: Microsoft Visual C++ 14.0 is required. Get it with “Microsoft Visual C++ Build Tools”: http://landinghub.visualstudio.com/visual-cpp-build-tools 解决办法 1. 安装 Microsoft visual c++ 14.0 https://964279924.ctfile.com/fs/1445568-239446865 或链接: https://pan.baidu.com/s/189wnAzjPjedFYjtb6le4Pw 提取码: dcm7 2. 如果出现了.Net framework版本过低，小于4.5的最低版本要求： [如果没出现这个问题，跳过这一步] 重新安装 .Net framework 更高的版本： https://support.microsoft.com/en-us/help/3151800/the-net-framework-4-6-2-offline-installer-for-windows 再安装Microsoft visual c++ 14.0 3. 启动电脑，再安装scrapy pip install scrapy 来源：

Submit form that renders dynamically with Scrapy?

阅读更多关于 Submit form that renders dynamically with Scrapy?

问题 I'm trying to submit a dynamically generated user login form using Scrapy and then parse the HTML on the page that corresponds to a successful login. I was wondering how I could do that with Scrapy or a combination of Scrapy and Selenium. Selenium makes it possible to find the element on the DOM, but I was wondering if it would be possible to "give control back" to Scrapy after getting the full HTML in order to allow it to carry out the form submission and save the necessary cookies, session