scrapy | 易学教程

Scrapy throws error ReactorNotRestartable when runnning on AWS Lambda

阅读更多关于 Scrapy throws error ReactorNotRestartable when runnning on AWS Lambda

问题 I have deployed a scrapy project which crawls whenever an lambda api requests comes. It runs perfectly for the first api call but later on it fails and throws ReactorNotRestartable error. As far as I can understand the AWS Lambda ecosystem is not killing the process, hence reactor is still present in the memory. The lambda log error is as follows: Traceback (most recent call last): File "/var/task/aws-lambda.py", line 42, in run_company_details_scrapy process.start() File "./lib/scrapy

httplib.BadStatusLine: ''

阅读更多关于 httplib.BadStatusLine: ''

问题 As always, I frequently have issues, and I have thoroughly searched for an answer to the current one but find myself at a loss. Here are some of the places I have searched: - How to fix httplib.BadStatusLine exception? - Python httplib2 Handling Exceptions - python http status code My issue is the following. I have created a spider and want to crawl different urls. When I crawl each url independently everything works fine. However, when I try to crawl both I get the following error: httplib

Python selenium screen capture not getting whole page

阅读更多关于 Python selenium screen capture not getting whole page

问题 I am trying to create a generic webcrawler that will go to a site and take a screenshot. I am using Python, Selnium, and PhantomJS. The problem is that the screenshot is not capturing all the images on a page. For example, if I go to you tube, it doesn't capture images below the main page image. (I don't have high enough rep to post screen shot) I think this may have something to do with dynamic content, but I have tried the wait functions such as implicitly wait and on set_page_load_timeout

Why does scrapy throw an error for me when trying to spider and parse a site?

阅读更多关于 Why does scrapy throw an error for me when trying to spider and parse a site?

问题 The following code class SiteSpider(BaseSpider): name = "some_site.com" allowed_domains = ["some_site.com"] start_urls = [ "some_site.com/something/another/PRODUCT-CATEGORY1_10652_-1__85667", ] rules = ( Rule(SgmlLinkExtractor(allow=('some_site.com/something/another/PRODUCT-CATEGORY_(.*)', ))), # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(SgmlLinkExtractor(allow=('some_site.com/something/another/PRODUCT-DETAIL(.*)', )), callback="parse_item"), )

Downloading pictures with scrapy

阅读更多关于 Downloading pictures with scrapy

问题 I'm starting with scrapy, and I have first real problem. It's downloading pictures. So this is my spider. from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import HtmlXPathSelector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from example.items import ProductItem from scrapy.utils.response import get_base_url import re class ProductSpider(CrawlSpider): name = "product" allowed_domains = ["domain.com"] start_urls = [ "http://www.domain.com/category

scrapy基础知识之 Scrapy 和 scrapy-redis的区别：

阅读更多关于 scrapy基础知识之 Scrapy 和 scrapy-redis的区别：

Scrapy 和 scrapy-redis的区别 Scrapy 是一个通用的爬虫框架，但是不支持分布式，Scrapy-redis是为了更方便地实现Scrapy分布式爬取，而提供了一些以redis为基础的组件(仅有组件)。 pip install scrapy-redis Scrapy-redis提供了下面四种组件（components）：(四种组件意味着这四个模块都要做相应的修改) Scheduler Duplication Filter Item Pipeline Base Spider scrapy-redis架构 Scheduler ： Scrapy改造了python本来的collection.deque(双向队列)形成了自己的Scrapy queue( https://github.com/scrapy/queuelib/blob/master/queuelib/queue.py) )，但是Scrapy多个spider不能共享待爬取队列Scrapy queue，即Scrapy本身不支持爬虫分布式，scrapy-redis 的解决是把这个Scrapy queue换成redis数据库（也是指redis队列），从同一个redis-server存放要爬取的request，便能让多个spider去同一个数据库里读取。 Scrapy中跟“待爬队列”直接相关的就是调度器

Python爬虫Scrapy框架：Scrapy爬虫框架与常用命令

阅读更多关于 Python爬虫Scrapy框架：Scrapy爬虫框架与常用命令

一.Scrapy爬虫框架大体框架 2个桥梁二.常用命令全局命令 startproject 语法: scrapy startproject <project_name> 这个命令是scrapy最为常用的命令之一，它将会在当前目录下创建一个名为 <project_name> 的项目。 settings 语法: scrapy settings [options] 该命令将会输出Scrapy默认设定，当然如果你在项目中运行这个命令将会输出项目的设定值。 runspider 语法: scrapy runspider <spider_file.py> 在未创建项目的情况下，运行一个编写在Python文件中的spider。推荐Python大牛在线分享技术扣qun：855408893 领域：web开发，爬虫，数据分析，数据挖掘，人工智能零基础到项目实战，7天学习上手做项目 shell 语法: scrapy shell [url] 以给定的URL(如果给出)或者空(没有给出URL)启动Scrapy shell。例如， scrapy shell http://www.baidu.com 将会打开百度URL，并且启动交互式命令行，可以用来做一些测试。 fetch 语法: scrapy fetch <url> 使用Scrapy下载器(downloader)下载给定的URL

Running Scrapy from a script - Hangs

阅读更多关于 Running Scrapy from a script - Hangs

问题 I'm trying to run scrapy from a script as discussed here. It suggested using this snippet, but when I do it hangs indefinitely. This was written back in version .10; is it still compatible with the current stable? 回答1: from scrapy import signals, log from scrapy.xlib.pydispatch import dispatcher from scrapy.crawler import CrawlerProcess from scrapy.conf import settings from scrapy.http import Request def handleSpiderIdle(spider): '''Handle spider idle event.''' # http://doc.scrapy.org/topics

How to generate the start_urls dynamically in crawling?

阅读更多关于 How to generate the start_urls dynamically in crawling?

问题 I am crawling a site which may contain a lot of start_urls , like: http://www.a.com/list_1_2_3.htm I want to populate start_urls like [list_\d+_\d+_\d+\.htm] , and extract items from URLs like [node_\d+\.htm] during crawling. Can I use CrawlSpider to realize this function? And how can I generate the start_urls dynamically in crawling? 回答1: The best way to generate URLs dynamically is to override the start_requests method of the spider: from scrapy.http.request import Request def start

Scrapy Shell and Scrapy Splash

阅读更多关于 Scrapy Shell and Scrapy Splash

问题 We've been using scrapy-splash middleware to pass the scraped HTML source through the Splash javascript engine running inside a docker container. If we want to use Splash in the spider, we configure several required project settings and yield a Request specifying specific meta arguments: yield Request(url, self.parse_result, meta={ 'splash': { 'args': { # set rendering arguments here 'html': 1, 'png': 1, # 'url' is prefilled from request url }, # optional parameters 'endpoint': 'render.json',