scrapy | 易学教程

Scrapy组件之item

阅读更多关于 Scrapy组件之item

Scrapy是一个流行的网络爬虫框架，从现在起将陆续记录Python3.6下Scrapy整个学习过程，方便后续补充和学习。 Python网络爬虫之scrapy(一) 已经介绍scrapy安装、项目创建和测试基本命令操作，本文将对item设置、提取和使用进行详细说明 item设置　　item是保存爬取到的数据的容器，其使用方式和字典类似，并且提供了额外保护机制来避免拼写错误导致的未定义字段错误，定义类型为scrapy.Field的类属性来定义一个item，可以根据自己的需要在items.py文件中编辑相应的item # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html #装载我们抓取数据的容器 import scrapy class ExampleItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() name = scrapy.Field() #属性作为Field对象 population = scrapy.Field

How to import django models in scrapy pipelines.py file

阅读更多关于 How to import django models in scrapy pipelines.py file

问题 I'm trying to import models of one django application in my pipelines.py to save data using django orm. I created a scrapy project scrapy_project in the first involved django application "app1" (is it a good choice by the way?). I added these lines to my scrapy settings file: def setup_django_env(path): import imp, os from django.core.management import setup_environ f, filename, desc = imp.find_module('settings', [path]) project = imp.load_module('settings', f, filename, desc) setup_environ

Crawling through pages with PostBack data javascript Python Scrapy

阅读更多关于 Crawling through pages with PostBack data javascript Python Scrapy

问题 I'm crawling through some directories with ASP.NET programming via Scrapy. The pages to crawl through are encoded as such: javascript:__doPostBack('ctl00$MainContent$List','Page$X') where X is an int between 1 and 180. The MainContent argument is always the same. I have no idea how to crawl into these. I would love to add something to the SLE rules as simple as allow=('Page$') or attrs='__doPostBack' , but my guess is that I have to be trickier in order to pull the info from the javascript

Scrapy Image Pipeline: How to rename images?

阅读更多关于 Scrapy Image Pipeline: How to rename images?

问题 I've a spider which fetches both the data and images. I want to rename the images with the respective 'title' which i'm fetching. Following is my code: spider1.py from imageToFileSystemCheck.items import ImagetofilesystemcheckItem import scrapy class TestSpider(scrapy.Spider): name = 'imagecheck' def start_requests(self): searchterms=['keyword1','keyword2',] for item in searchterms: yield scrapy.Request('http://www.example.com/s?=%s' % item,callback=self.parse, meta={'item': item}) def parse

Python pip安装Scrapy，报错Twisted

阅读更多关于 Python pip安装Scrapy，报错Twisted

Scrapy依赖的包有如下： lxml：一种高效的XML和HTML解析器 w3lib：一种处理URL和网页编码多功能辅助 twisted：一个异步网络框架 cryptography 和 pyOpenSSL：处理各种网络级安全需求 —————————————————————————— 1.先运行一次pip安装 pip install Scrapy 2.安装完一次过后，基本除了报错twisted没安装成功以外，其他依赖包应该是安装好了。然后自行下载twisted，注意：要对应你的python版本号和电脑系统的位数我用的是python37，系统64位的。 https://www.lfd.uci.edu/~gohlke/pythonlibs/ 3.下载后，pip安装 pip install [文件路径]\Twisted-18.9.0-cp37-cp37m-win_amd64.whl 4.最后再运行一次Scrapy的pip安装就可以安装成功了。 ———————————————— 版权声明：本文为CSDN博主「Sagittarius32」的原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接及本声明。原文链接：https://blog.csdn.net/sagittarius32/article/details/85345142 来源： https://www

Python 爬虫框架Scrapy安装汇总

阅读更多关于 Python 爬虫框架Scrapy安装汇总

传统方式安装Scrapy(慎用) 练习了基本的操作之后，当然就要找框架来进行爬虫实验啊。于是就在网上找Windows 64安装Scrapy的方法，查到的都是非常繁琐的安装方式，由于Scrapy有很多个依赖，所以在安装Scrapy之前你就要先安装他的所有的依赖啊，下面列举出部分依赖库： lxml模块 cryptography模块 pywin32模块 Twisted模块 pyOpenSSL模块等等，大家想想啊，Python怎么会那么顺利的让我们安装完这里模块呢？答案是一定的。有些人会说，我就不信我直接一个命令pip install Scrapy看看能不能直接安装上，敲完命令直接之歌回车键，看见命令行工具上显示的安装过程还笑出声来，终于可以进行爬虫了，没想到最后出现了一个错误failed with error code 1 in C:****************\Temp\pip-build-5f9_epll\Twisted\，于是乎就查到原来没有Twisted这个依赖的模块，想着pip install Twisted最后命令行工具上还是继续报错，安装失败啊(pip install Twisted[windows_platform]这个命令也试过，没有什么用的)。安装Twisted模块在这里告诉大家一个方法安装Twisted模块的方法啊，首先需要先安装wheel模块

Scrapy: Default values for items & fields. What is the best implementation?

阅读更多关于 Scrapy: Default values for items & fields. What is the best implementation?

问题 As far as I could find out from the documentation and various discussions on the net, the ability to add default values to fields in a scrapy item has been removed. This doesn't work category = Field(default='null') So my question is: what is a good way to initialize fields with a default value? I already tried to implement it as a item pipeline as suggested here, without any success. https://groups.google.com/forum/?fromgroups=#!topic/scrapy-users/-v1p5W41VDQ 回答1: figured out what the

Scrapy: Google Crawl doesn't work

阅读更多关于 Scrapy: Google Crawl doesn't work

问题 When I try to crawl Google for search results, Scrapy just yields the Google home page: http://pastebin.com/FUbvbhN4 Here is my spider: import scrapy class GoogleFinanceSpider(scrapy.Spider): name = "google" start_urls = ['http://www.google.com/#q=finance.google.com:+3m+co'] allowed_domains = ['www.google.com'] def parse(self, response): filename = response.url.split("/")[-2] + '.html' with open(filename, 'wb') as f: f.write(response.body) Is there something wrong with this url as a starting

Scrapy: Google Crawl doesn't work

阅读更多关于 Scrapy: Google Crawl doesn't work

Scrapy: Google Crawl doesn't work

阅读更多关于 Scrapy: Google Crawl doesn't work