scrapy

How to get a single item across many sites in scrapy?

不羁的心 提交于 2020-01-03 05:26:13
问题 I have this situation: I want to crawl products details from a specific product detail page which describes the product (Page A), this page contains a link to a page that list sellers of this product (Page B), in each seller is a link to another page (Page C) which contains seller details, here is an example schema: Page A: product_name link to sellers of this product (Page B) Page B: list of sellers, each one containing: seller_name selling_price link to the seller details page (Page C) Page

Scrapy-splash - does splash:go(url) in lua_script perform GET request again?

為{幸葍}努か 提交于 2020-01-03 05:16:09
问题 I'm new to Scrapy-splash and I'm trying to scrape a lazy datatable which is a table with AJAX pagination. So I need to load the website, wait until JS is executed, get html of the table and then click on the "Next" button on pagination. My approach works but I'm afraid I'm requesting the website two times. First time when I yield the SplashRequest and then when lua_script is executed. Is it true? If yes, how to make it perform request just once? class JSSpider(scrapy.Spider): name = 'js

Scrapy-splash - does splash:go(url) in lua_script perform GET request again?

旧城冷巷雨未停 提交于 2020-01-03 05:16:08
问题 I'm new to Scrapy-splash and I'm trying to scrape a lazy datatable which is a table with AJAX pagination. So I need to load the website, wait until JS is executed, get html of the table and then click on the "Next" button on pagination. My approach works but I'm afraid I'm requesting the website two times. First time when I yield the SplashRequest and then when lua_script is executed. Is it true? If yes, how to make it perform request just once? class JSSpider(scrapy.Spider): name = 'js

ValueError while deploying Scrapy

偶尔善良 提交于 2020-01-03 04:42:07
问题 I am trying to deploy Scrapy with scrapyd-deploy command and now it throws next error: Packing version 1526919848 Deploying to project "first_scrapy" in http://my_ip:6800/addversion.json Server response (200): {"node_name": "polo", "message": "Traceback (most recent call last):\n File \"/usr/lib/python3.5/logging/config.py\", line 558, in configure\n handler = self.configure_handler(handlers[name])\n File \"/usr/lib/python3.5/logging/config.py\", line 731, in configure_handler\n result =

scrapy之手机app抓包爬虫

谁说胖子不能爱 提交于 2020-01-03 04:39:28
手机App抓包爬虫 1. items.py class DouyuspiderItem(scrapy.Item): name = scrapy.Field()# 存储照片的名字 imagesUrls = scrapy.Field()# 照片的url路径 imagesPath = scrapy.Field()# 照片保存在本地的路径 2. spiders/douyu.py import scrapy import json from douyuSpider.items import DouyuspiderItem class DouyuSpider(scrapy.Spider): name = "douyu" allowd_domains = ["http://capi.douyucdn.cn"] offset = 0 url = "http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset=" start_urls = [url + str(offset)] def parse(self, response): # 返回从json里获取 data段数据集合 data = json.loads(response.text)["data"] for each in data: item = DouyuspiderItem

爬虫——Scrapy框架案例一:手机APP抓包

两盒软妹~` 提交于 2020-01-03 04:38:26
以爬取斗鱼直播上的信息为例: URL地址:http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset=0 爬取字段:房间ID、房间名、图片链接、存储在本地的图片路径、昵称、在线人数、城市 1.items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class DouyuspiderItem(scrapy.Item): # define the fields for your item here like: # 房间ID room_id = scrapy.Field() # 房间名 room_name = scrapy.Field() # 图片链接 vertical_src = scrapy.Field() # 存储图片的本地地址 image_path = scrapy.Field() # 昵称 nickname = scrapy.Field() # 在线人数 online = scrapy.Field() # 城市 anchor

使用python3.7中的scrapy框架,爬取起点小说

谁都会走 提交于 2020-01-03 04:34:11
这几天在学习scrapy框架,感觉有所收获,便尝试使用scrapy框架来爬取一些数据,对自己阶段性学习进行一个小小的总结 本次爬取的目标数据是起点中文网中的免费作品部分,如下图: 本次一共爬取了100本小说,并对爬取结果进行以下两种存储; 1.把小说内容分章节写入txt中 2.把小说的内容存入sqlserver中 如下: 实现的逻辑: 1.通过书的列表页获得每本书的具体url; 2.通过书籍的url获得书的章节和每个章节对应的url; 3.通过每个章节的url获取每个章节的文本内容; 4.将提取的文本进行存储,txt和sqlserver。 项目代码部分: 新建名为qidian的scrapy项目,新建名为xiaoshuo.py的爬虫 setting.py: BOT_NAME = 'qidian' SPIDER_MODULES = ['qidian.spiders'] NEWSPIDER_MODULE = 'qidian.spiders' LOG_LEVEL = "WARNING" # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537

爬虫实战篇---使用Scrapy框架进行汽车之家宝马图片下载爬虫

丶灬走出姿态 提交于 2020-01-03 04:22:52
(1)、前言 Scrapy框架为文件和图片的下载专门提供了两个Item Pipeline 它们分别是: FilePipeline ImagesPipeline (2)、使用Scrapy内置的下载方法的好处 1、可以有效避免重复下载 2、方便指定下载路径 3、方便格式转换,例如可以有效的将图片转换为png 或jpg 4、方便生成缩略图 5、方便调整图片大小 6、异步下载,高效率 (3)、较为传统的Scrapy框架图片下载方式 1、创建项目:scrapy startproject baoma---cd baoma --创建爬虫scrapy genspider spider car.autohome.com.cn 2、使用pycharm打开项目 改写settings.py 不遵守robots协议 设置请求头 开启pipelines.py 改写spider.py 1 # -*- coding: utf-8 -*- 2 import scrapy 3 from ..items import BaomaItem 4 5 class SpiderSpider(scrapy.Spider): 6 name = 'spider' 7 allowed_domains = ['car.autohome.com.cn'] 8 start_urls = ['https://car.autohome.com

scrapy初试水 day02(正则提取)

匆匆过客 提交于 2020-01-03 04:11:20
1.处理方式 法一 通过HtmlXPathSelector import scrapy from scrapy.selector import HtmlXPathSelector class DmozSpider(scrapy.Spider): name = "use_scrapy" #要调用的名字 allowed_domains = ["use_scrapy.com"] #分一个域 start_urls = [#所有要爬路径 "http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%8C%97%E4%BA%AC&kw=python&sm=0&p=1" ] #每爬完一个网页会回调parse方法 def parse(self, response): hxs=HtmlXPathSelector(response) print('_________________________') hxsobj=hxs.select('//td[@class="zwmc"]/div/a') print(hxsobj[0].select("@href").extract())#获取链接 print(hxsobj[0].select("text()").extract())#获取文本 # .extract()是显示网页的原文 print(len

Scrapy + Selenium + Datepicker

给你一囗甜甜゛ 提交于 2020-01-03 03:33:06
问题 So i need to scrap a page like this for example and i am using Scrapy + Seleninum to interact with the datepicker calendar but i am running into a ElementNotVisibleException: Message: Element is not currently visible and so may not be interacted with . So far i have: def parse(self, response): self.driver.get("https://www.airbnb.pt/rooms/9315238") try: element = WebDriverWait(self.driver, 10).until( EC.presence_of_element_located((By.XPATH, "//input[@name='checkin']")) ) finally: x = self