scrapy | 易学教程

Can't run Scrapy program

阅读更多关于 Can't run Scrapy program

问题 I have been learning how to work with Scrapy from the following link : http://doc.scrapy.org/en/master/intro/tutorial.html When i try to run the code written in the Crawling( scrapy crawl dmoz ) section, i get the following error: AttributeError: 'module' object has no attribute 'Spider ' However, i changed "Spider" to "spider" and i got nothing but a new error: TypeError: Error when calling the metaclass bases module.__init__() takes at most 2 arguments (3 given) I'm so confused, what is the

scrapy抓取拉勾网职位信息（七）——实现分布式

阅读更多关于 scrapy抓取拉勾网职位信息（七）——实现分布式

上篇我们实现了数据的存储，包括把数据存储到MongoDB，Mysql以及本地文件，本篇说下分布式。我们目前实现的是一个单机爬虫，也就是只在一个机器上运行，想象一下，如果同时有多台机器同时运行这个爬虫，并且把数据都存储到同一个数据库，那不是美滋滋，速度也得到了很大的提升。要实现分布式，只需要对settings.py文件进行适当的配置就能完成。文档时间：官方文档介绍如下： Use the following settings in your project: # Enables scheduling storing requests queue in redis. SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Ensure all spiders share same duplicates filter through redis. DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # Default requests serializer is pickle, but it can be changed to any module # with loads and dumps functions. Note that pickle is not

scrapy抓取拉勾网职位信息（四）——对字段进行提取

阅读更多关于 scrapy抓取拉勾网职位信息（四）——对字段进行提取

上一篇中已经分析了详情页的url规则，并且对items.py文件进行了编写，定义了我们需要提取的字段，本篇将具体的items字段提取出来这里主要是涉及到选择器的一些用法，如果不是很熟，可以参考： scrapy选择器的使用依旧是在lagou_c.py文件中编写代码首先是导入LagouItem类，因为两个__init__.py文件的存在，所在的文件夹可以作为python包来使用 from lagou.items import LagouItem 编写parse_item()函数（同样为了详细解释，又是一波注释风暴）： def parse_item(self, response): item = LagouItem() #生成一个item对象 item['url'] = response.url #这个response是详情页面的response，因为本次我们只对详情页面使用了回调函数，所以可以这样理解 item['name'] = response.css('.name::text').extract_first() #用css选择器选择职位名称，因为结果是个列表，所以使用extract_first()提取第一个 item['salary'] = response.css('.salary::text').extract_first() #用css选择器选择薪水

scrapy抓取拉勾网职位信息（五）——代码优化

阅读更多关于 scrapy抓取拉勾网职位信息（五）——代码优化

上一篇我们已经让代码跑起来，各个字段也能在控制台输出，但是以item类字典的形式写的代码过于冗长，且有些字段出现的结果不统一，比如发布日期。而且后续要把数据存到数据库，目前的字段基本都是string类型，会导致占用空间较多，查询时速度会较慢，所以本篇先对目前已写好的代码进行适当优化。本篇目的：使用item loader以及processor对代码进行优化，对字段数据进行清洗 1、修改一下items.py文件的字段我们对工资和工作经验字段进行分割让其更适合数据库存储： import scrapy class LagouItem(scrapy.Item): url = scrapy.Field() name = scrapy.Field() ssalary = scrapy.Field() #最低工资 esalary = scrapy.Field() #最高工资 location = scrapy.Field() syear = scrapy.Field() #最低工作经验 eyear = scrapy.Field() #最高工作经验 edu_background = scrapy.Field() type = scrapy.Field() tags = scrapy.Field() release_time = scrapy.Field() advantage = scrapy

Why this inconsistent behaviour using scrapy shell printing results?

阅读更多关于 Why this inconsistent behaviour using scrapy shell printing results?

问题 Load the scrapy shell scrapy shell "http://www.worldfootball.net/all_matches/eng-premier-league-2015-2016/" Try a selector: response.xpath('(//table[@class="standard_tabelle"])[1]/tr[not(th)]') Note: it prints results. But now use that selector as a for statement: for row in response.xpath('(//table[@class="standard_tabelle"])[1]/tr[not(th)]'): row.xpath(".//a[contains(@href, 'report')]/@href").extract_first() Hit return twice, nothing is printed. To print results inside the for loop, you

what's the meaning of request.headers.setdefault() in scrapy

阅读更多关于 what's the meaning of request.headers.setdefault() in scrapy

问题 I wanna set custom UserAgentMiddleware with scrapy. But I don't know the action of request.headers.setdefault('User-Agent', ua) when I saw it, and I didn't find the method both document of scrapy and requests. Where can I find the any explanation about it? 回答1: headers is a normal dictionary, so setdefault would be a way to set a value to that dictionary if that value isn't present there already. The explanation would be that the Middleware sets the User-Agent by defaut only if you didn't set

Xpath Error - Spider error processing

阅读更多关于 Xpath Error - Spider error processing

问题 So i am building this spider and it crawls fine, because i can log into the shell and go through the HTML page and test my Xpath queries. Not sure what i am doing wrong. Any help would be appreciated. I have re installed Twisted, but nothing. My spider looks like this - from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from spider_scrap.items import spiderItem class spider(BaseSpider): name="spider1" #allowed_domains = ["example.com"] start_urls = [ "http:/

『Scrapy』爬虫框架入门

阅读更多关于『Scrapy』爬虫框架入门

框架结构引擎：处于中央位置协调工作的模块 spiders：生成需求url直接处理响应的单元调度器：生成url队列（包括去重等）下载器：直接和互联网打交道的单元管道：持久化存储的单元框架安装一般都会推荐pip，但实际上我是用pip就是没安装成功，推荐anaconda，使用conda install scarpy来安装。 scarpy需要使用命令行，由于我是使用win，所以还需要把scarpy添加到path中，下载好的scarpy放在anaconda的包目录下，找到并添加。框架入门创建项目在开始爬取之前，您必须创建一个新的Scrapy项目。进入您打算存储代码的目录中，运行下列命令: scrapy startproject tutorial 该命令将会创建包含下列内容的 tutorial 目录，这个目录会创建在当前cmd的工作目录下: tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ... 这些文件分别是: scrapy.cfg : 项目的配置文件 tutorial/ : 该项目的python模块。之后您将在此加入代码。 tutorial/items.py : 项目中的item文件. tutorial/pipelines

Python-Scrapy创建第一个项目

阅读更多关于 Python-Scrapy创建第一个项目

创建项目在开始爬取之前，您必须创建一个新的Scrapy项目。进入您打算存储代码的目录中，运行下列命令： scrapy startproject tutorial 该命令行将会创建包含下列内容的 tutorial 目录： tutorial/ scrapy.cfg tutorial/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py ... 这些文件分别是： scrapy.cfg:项目的配置文件 tutorial:该项目的python模块。之后您将在此加入代码。 tutorial/items.py:项目中的item文件。 tutorial/pipelines.py:项目中的pipelines文件。 tutorial/spiders/：放置spider代码的目录。定义Item Item是保存爬取到的数据的容器：其使用方法和python字典类似，并且提供了额外保护机制来避免拼写错误导致的未定义字段错误。类似在ORM中做的一样，你可以通过创建一个 scrapy.Item 类，并且定义类型为 scrapy.Field 的类属性来定义一个Item。首先根据需要从dmoz.org获取到的数据对item进行建模。我们需要从dmoz中获取名字，url，以及网站的描述。对此，在item中定义相应的字段。编辑

scrapy框架

阅读更多关于 scrapy框架

入门案例学习目标创建一个Scrapy项目定义提取的结构化数据(Item) 编写爬取网站的 Spider 并提取出结构化数据(Item) 编写 Item Pipelines 来存储提取到的Item(即结构化数据) 一. 新建项目(scrapy startproject) 在开始爬取之前，必须创建一个新的Scrapy项目。进入自定义的项目目录中，运行下列命令： scrapy startproject mySpider 其中， mySpider 为项目名称，可以看到将会创建一个 mySpider 文件夹，目录结构大致如下：下面来简单介绍一下各个主要文件的作用： scrapy.cfg ：项目的配置文件 mySpider/ ：项目的Python模块，将会从这里引用代码 mySpider/items.py ：项目的目标文件 mySpider/pipelines.py ：项目的管道文件 mySpider/settings.py ：项目的设置文件 mySpider/spiders/ ：存储爬虫代码目录二、明确目标(mySpider/items.py) 我们打算抓取： http://www.itcast.cn/channel/teacher.shtml 网站里的所有讲师的姓名、职称和个人信息。打开mySpider目录下的items.py Item 定义结构化数据字段，用来保存爬取到的数据

订阅 scrapy