scrapy

Scrapy css selector: get text of all inner tags

北慕城南 提交于 2020-01-10 09:24:32
问题 I have a tag and I want to get all the text inside available. I am doing this: response.css('mytag::text') But it is only getting the text of the current tag, I also want to get the text from all the inner tags. I know I could do something like: response.xpath('//mytag//text()') But I would like to do it with css selectors. How can I achieve this? 回答1: response.css('mytag *::text') The * will visit all the inner tags of mytag and ::text will get the text of each of them 回答2: Get text of only

Force spider to stop in scrapy

家住魔仙堡 提交于 2020-01-10 03:26:08
问题 I have 20 spiders in one project, each spider has different task and URL to crawl ( but data are similar and I'm using shared items.py and pipelines.py for all of them), by the way in my pipelines class I want if some conditions satisfied that specified spider stop crawl. I've testing raise DropItem("terminated by me") and raise CloseSpider('terminate by me') but both of them just stop the current running of spider and next_page url still crawling !!! part of my pipelines.py class

Scraping with Scrapy and Selenium

こ雲淡風輕ζ 提交于 2020-01-09 19:33:12
问题 I have a scrapy spider which crawls a site that reloads content via javascript on the page. In order to move to the next page to scrape, I have been using Selenium to click on the month link at the top of the site. The problem is that, even though my code moves through each link as expected, the spider just scrapes the first month (Sept) data for the number of months and returns this duplicate data. How can I get around this? from selenium import webdriver class GigsInScotlandMain(InitSpider)

scrapy框架

。_饼干妹妹 提交于 2020-01-08 19:37:42
一 介绍 scrapy官网链接https://docs.scrapy.org/en/latest/topics/commands.html Scrapy一个开源和协作的框架,其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的,使用它可以以快速、简单、可扩展的方式从网站中提取所需的数据。但目前Scrapy的用途十分广泛,可用于如数据挖掘、监测和自动化测试等领域,也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。 Scrapy 是基于twisted框架开发而来,twisted是一个流行的事件驱动的python网络框架。因此Scrapy使用了一种非阻塞(又名异步)的代码来实现并发。整体架构大致如下 The data flow in Scrapy is controlled by the execution engine, and goes like this: The Engine gets the initial Requests to crawl from the Spider . The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl. The Scheduler

scrapy爬虫小案例1

吃可爱长大的小学妹 提交于 2020-01-08 18:11:50
1.items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class MyspiderItem(scrapy.Item): brief = scrapy.Field() quote = scrapy.Field() 2.myspider.py # -*- coding: utf-8 -*- import scrapy from mySpider.items import MyspiderItem # 创建一个爬虫类 class MyspiderSpider(scrapy.Spider): # 爬虫名 name = 'myspider' # 允许爬虫作用的范围(只能在你输入的这个网址爬取信息) allowed_domains = ['https://market.douban.com/'] # 爬虫起始url start_urls = ['https://market.douban.com/book/?utm_campaign=book_freyr_section&utm_source

scrapy爬虫小案例

南笙酒味 提交于 2020-01-08 17:49:51
在豆瓣图书爬取书籍信息为例(爬取下面划红线的信息) 1.先创建一个mySpider项目(如何创建项目上面已经说过了) 2.打开mySpider目录下的items.py Item 定义结构化数据字段,用来保存爬取到的数据(因为要爬取的是两行信息,下面定义两个变量来存取字符串) # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class MyspiderItem(scrapy.Item): brief = scrapy.Field() quote = scrapy.Field() 2.在Terminal终端创建爬虫 3.重写myspider.py # -*- coding: utf-8 -*- import scrapy from mySpider.items import MyspiderItem # 创建一个爬虫类 class MyspiderSpider(scrapy.Spider): # 爬虫名 name = 'myspider' # 允许爬虫作用的范围(只能在你输入的这个网址爬取信息)

爬虫开发环境总结

会有一股神秘感。 提交于 2020-01-08 09:31:00
文章目录 1 python 2 请求库 2.1requests 2.2Selenium 的安装 2.3ChromeDriver 2.4 PhantomJS 2.5 aiohttp 3解析库 3.1 lxml 3.2Beautiful Soup 3.3pyquery 3.4tesserocr 4 数据库 4.1 MySQL 4.2 MongoDB 4.3 Redis 5 存储库 5.1 PyMySQL 5.2 PyMongo 5.3 redis-py 5.4 RedisDump 6 web库 6.1 Flask 6.2 Tornado 7 App爬取相关库 7.1 Charles 7.2 mitproxy 7.3 Appium 8 爬虫框架 8.1 pyspider 8.2 Scrapy 8.3 Scrapy-Splash 8.4 Scrapy-Redis 9 部署相关库 9.1 Docker 9.2 Scrapyrt 9.3 Gerapy 10 参考资料 1 python 既然要用 Python 3 开发爬虫,那么第一步一定是安装 Python 3 2 请求库 爬虫可以简单分为几步:抓取页面、分析页面和存储数据。 在抓取页面的过程中 ,我们需要模拟浏览器向服务器发出请求,所以需要用到一些 Python 库来实 现 HTTP 请求操作 。 在本书中,我们用到的第三方库有

Scrapy源码 Response对象

孤街醉人 提交于 2020-01-07 23:47:37
Scrapy源码 Response对象 """This module implements the Response class which is used to represent HTTPresponses in Scrapy.See documentation in docs/topics/request-response.rst"""from six.moves.urllib.parse import urljoinfrom scrapy.http.request import Requestfrom scrapy.http.headers import Headersfrom scrapy.link import Linkfrom scrapy.utils.trackref import object_reffrom scrapy.http.common import obsolete_setterfrom scrapy.exceptions import NotSupportedclass Response(object_ref): def __init__(self, url, status=200, headers=None, body=b'', flags=None, request=None): self.headers = Headers(headers or

Scrapy源码 Request对象

孤街浪徒 提交于 2020-01-07 23:46:59
Scrapy源码 Request对象 """This module implements the Request class which is used to represent HTTPrequests in Scrapy.See documentation in docs/topics/request-response.rst"""import sixfrom w3lib.url import safe_url_stringfrom scrapy.http.headers import Headersfrom scrapy.utils.python import to_bytesfrom scrapy.utils.trackref import object_reffrom scrapy.utils.url import escape_ajaxfrom scrapy.http.common import obsolete_setterfrom scrapy.utils.curl import curl_to_request_kwargsclass Request(object_ref): def __init__(self, url, callback=None, method='GET', headers=None, body=None, cookies=None, meta

python爬虫----(2. scrapy框架)

纵饮孤独 提交于 2020-01-07 20:07:22
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> Scrapy框架,Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。 刚开始学习这个框架。不太好评论。只是感觉这个框架有些Java的感觉,需要太多的其他模块的支持。 (一)创建 scrapy 项目 # 使用 scrapy startproject scrapy_test ├── scrapy_test │ ├── scrapy.cfg │ └── scrapy_test │ ├── __init__.py │ ├── items.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ ├── __init__.py # 进行创建 scrapy 项目 (二)说明 scrapy.cfg: 项目配置文件 items.py: 需要提取的数据结构定义文件 pipelines.py:管道定义,用来对items里面提取的数据做进一步处理,如保存等 settings.py: 爬虫配置文件 spiders: 放置spider的目录 (三)依赖包 依赖包比较麻烦。 # python-dev 包的安装 apt-get install python-dev #