scrapy

Scrapy组件之item

◇◆丶佛笑我妖孽 提交于 2020-01-22 20:53:54
Scrapy是一个流行的网络爬虫框架,从现在起将陆续记录Python3.6下Scrapy整个学习过程,方便后续补充和学习。 Python网络爬虫之scrapy(一) 已经介绍scrapy安装、项目创建和测试基本命令操作,本文将对item设置、提取和使用进行详细说明 item设置   item是保存爬取到的数据的容器,其使用方式和字典类似,并且提供了额外保护机制来避免拼写错误导致的未定义字段错误,定义类型为scrapy.Field的类属性来定义一个item,可以根据自己的需要在items.py文件中编辑相应的item # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html #装载我们抓取数据的容器 import scrapy class ExampleItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() name = scrapy.Field() #属性作为Field对象 population = scrapy.Field

How to import django models in scrapy pipelines.py file

天涯浪子 提交于 2020-01-22 16:17:26
问题 I'm trying to import models of one django application in my pipelines.py to save data using django orm. I created a scrapy project scrapy_project in the first involved django application "app1" (is it a good choice by the way?). I added these lines to my scrapy settings file: def setup_django_env(path): import imp, os from django.core.management import setup_environ f, filename, desc = imp.find_module('settings', [path]) project = imp.load_module('settings', f, filename, desc) setup_environ

Crawling through pages with PostBack data javascript Python Scrapy

房东的猫 提交于 2020-01-22 08:30:28
问题 I'm crawling through some directories with ASP.NET programming via Scrapy. The pages to crawl through are encoded as such: javascript:__doPostBack('ctl00$MainContent$List','Page$X') where X is an int between 1 and 180. The MainContent argument is always the same. I have no idea how to crawl into these. I would love to add something to the SLE rules as simple as allow=('Page$') or attrs='__doPostBack' , but my guess is that I have to be trickier in order to pull the info from the javascript

Scrapy Image Pipeline: How to rename images?

笑着哭i 提交于 2020-01-22 02:51:07
问题 I've a spider which fetches both the data and images. I want to rename the images with the respective 'title' which i'm fetching. Following is my code: spider1.py from imageToFileSystemCheck.items import ImagetofilesystemcheckItem import scrapy class TestSpider(scrapy.Spider): name = 'imagecheck' def start_requests(self): searchterms=['keyword1','keyword2',] for item in searchterms: yield scrapy.Request('http://www.example.com/s?=%s' % item,callback=self.parse, meta={'item': item}) def parse

Python pip安装Scrapy,报错Twisted

旧街凉风 提交于 2020-01-21 15:48:45
Scrapy依赖的包有如下: lxml:一种高效的XML和HTML解析器 w3lib:一种处理URL和网页编码多功能辅助 twisted:一个异步网络框架 cryptography 和 pyOpenSSL:处理各种网络级安全需求 —————————————————————————— 1.先运行一次pip安装 pip install Scrapy 2.安装完一次过后,基本除了报错twisted没安装成功以外,其他依赖包应该是安装好了。 然后自行下载twisted,注意:要对应你的python版本号和电脑系统的位数 我用的是python37,系统64位的。 https://www.lfd.uci.edu/~gohlke/pythonlibs/ 3.下载后,pip安装 pip install [文件路径]\Twisted-18.9.0-cp37-cp37m-win_amd64.whl 4.最后再运行一次Scrapy的pip安装就可以安装成功了。 ———————————————— 版权声明:本文为CSDN博主「Sagittarius32」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。 原文链接:https://blog.csdn.net/sagittarius32/article/details/85345142 来源: https://www

Python 爬虫框架Scrapy安装汇总

半世苍凉 提交于 2020-01-21 11:22:27
传统方式安装Scrapy(慎用) 练习了基本的操作之后,当然就要找框架来进行爬虫实验啊。于是就在网上找Windows 64安装Scrapy的方法,查到的都是非常繁琐的安装方式,由于Scrapy有很多个依赖,所以在安装Scrapy之前你就要先安装他的所有的依赖啊,下面列举出部分依赖库: lxml模块 cryptography模块 pywin32模块 Twisted模块 pyOpenSSL模块等等,大家想想啊,Python怎么会那么顺利的让我们安装完这里模块呢?答案是一定的。有些人会说,我就不信我直接一个命令pip install Scrapy看看能不能直接安装上,敲完命令直接之歌回车键,看见命令行工具上显示的安装过程还笑出声来,终于可以进行爬虫了,没想到最后出现了一个错误failed with error code 1 in C:****************\Temp\pip-build-5f9_epll\Twisted\,于是乎就查到原来没有Twisted这个依赖的模块,想着pip install Twisted最后命令行工具上还是继续报错,安装失败啊(pip install Twisted[windows_platform]这个命令也试过,没有什么用的)。 安装Twisted模块 在这里告诉大家一个方法安装Twisted模块的方法啊,首先需要先安装wheel模块

Scrapy: Default values for items & fields. What is the best implementation?

末鹿安然 提交于 2020-01-21 11:09:05
问题 As far as I could find out from the documentation and various discussions on the net, the ability to add default values to fields in a scrapy item has been removed. This doesn't work category = Field(default='null') So my question is: what is a good way to initialize fields with a default value? I already tried to implement it as a item pipeline as suggested here, without any success. https://groups.google.com/forum/?fromgroups=#!topic/scrapy-users/-v1p5W41VDQ 回答1: figured out what the

Scrapy: Google Crawl doesn't work

青春壹個敷衍的年華 提交于 2020-01-21 10:42:45
问题 When I try to crawl Google for search results, Scrapy just yields the Google home page: http://pastebin.com/FUbvbhN4 Here is my spider: import scrapy class GoogleFinanceSpider(scrapy.Spider): name = "google" start_urls = ['http://www.google.com/#q=finance.google.com:+3m+co'] allowed_domains = ['www.google.com'] def parse(self, response): filename = response.url.split("/")[-2] + '.html' with open(filename, 'wb') as f: f.write(response.body) Is there something wrong with this url as a starting

Scrapy: Google Crawl doesn't work

懵懂的女人 提交于 2020-01-21 10:41:55
问题 When I try to crawl Google for search results, Scrapy just yields the Google home page: http://pastebin.com/FUbvbhN4 Here is my spider: import scrapy class GoogleFinanceSpider(scrapy.Spider): name = "google" start_urls = ['http://www.google.com/#q=finance.google.com:+3m+co'] allowed_domains = ['www.google.com'] def parse(self, response): filename = response.url.split("/")[-2] + '.html' with open(filename, 'wb') as f: f.write(response.body) Is there something wrong with this url as a starting

Scrapy: Google Crawl doesn't work

纵然是瞬间 提交于 2020-01-21 10:41:47
问题 When I try to crawl Google for search results, Scrapy just yields the Google home page: http://pastebin.com/FUbvbhN4 Here is my spider: import scrapy class GoogleFinanceSpider(scrapy.Spider): name = "google" start_urls = ['http://www.google.com/#q=finance.google.com:+3m+co'] allowed_domains = ['www.google.com'] def parse(self, response): filename = response.url.split("/")[-2] + '.html' with open(filename, 'wb') as f: f.write(response.body) Is there something wrong with this url as a starting