scrapy | 易学教程

How to get a single item across many sites in scrapy?

阅读更多关于 How to get a single item across many sites in scrapy?

问题 I have this situation: I want to crawl products details from a specific product detail page which describes the product (Page A), this page contains a link to a page that list sellers of this product (Page B), in each seller is a link to another page (Page C) which contains seller details, here is an example schema: Page A: product_name link to sellers of this product (Page B) Page B: list of sellers, each one containing: seller_name selling_price link to the seller details page (Page C) Page

Scrapy-splash - does splash:go(url) in lua_script perform GET request again?

阅读更多关于 Scrapy-splash - does splash:go(url) in lua_script perform GET request again?

问题 I'm new to Scrapy-splash and I'm trying to scrape a lazy datatable which is a table with AJAX pagination. So I need to load the website, wait until JS is executed, get html of the table and then click on the "Next" button on pagination. My approach works but I'm afraid I'm requesting the website two times. First time when I yield the SplashRequest and then when lua_script is executed. Is it true? If yes, how to make it perform request just once? class JSSpider(scrapy.Spider): name = 'js

Scrapy-splash - does splash:go(url) in lua_script perform GET request again?

阅读更多关于 Scrapy-splash - does splash:go(url) in lua_script perform GET request again?

ValueError while deploying Scrapy

阅读更多关于 ValueError while deploying Scrapy

问题 I am trying to deploy Scrapy with scrapyd-deploy command and now it throws next error: Packing version 1526919848 Deploying to project "first_scrapy" in http://my_ip:6800/addversion.json Server response (200): {"node_name": "polo", "message": "Traceback (most recent call last):\n File \"/usr/lib/python3.5/logging/config.py\", line 558, in configure\n handler = self.configure_handler(handlers[name])\n File \"/usr/lib/python3.5/logging/config.py\", line 731, in configure_handler\n result =

scrapy之手机app抓包爬虫

阅读更多关于 scrapy之手机app抓包爬虫

手机App抓包爬虫 1. items.py class DouyuspiderItem(scrapy.Item): name = scrapy.Field()# 存储照片的名字 imagesUrls = scrapy.Field()# 照片的url路径 imagesPath = scrapy.Field()# 照片保存在本地的路径 2. spiders/douyu.py import scrapy import json from douyuSpider.items import DouyuspiderItem class DouyuSpider(scrapy.Spider): name = "douyu" allowd_domains = ["http://capi.douyucdn.cn"] offset = 0 url = "http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset=" start_urls = [url + str(offset)] def parse(self, response): # 返回从json里获取 data段数据集合 data = json.loads(response.text)["data"] for each in data: item = DouyuspiderItem

爬虫——Scrapy框架案例一：手机APP抓包

阅读更多关于爬虫——Scrapy框架案例一：手机APP抓包

以爬取斗鱼直播上的信息为例： URL地址：http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset=0 爬取字段：房间ID、房间名、图片链接、存储在本地的图片路径、昵称、在线人数、城市 1.items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class DouyuspiderItem(scrapy.Item): # define the fields for your item here like: # 房间ID room_id = scrapy.Field() # 房间名 room_name = scrapy.Field() # 图片链接 vertical_src = scrapy.Field() # 存储图片的本地地址 image_path = scrapy.Field() # 昵称 nickname = scrapy.Field() # 在线人数 online = scrapy.Field() # 城市 anchor

使用python3.7中的scrapy框架，爬取起点小说

阅读更多关于使用python3.7中的scrapy框架，爬取起点小说

这几天在学习scrapy框架，感觉有所收获，便尝试使用scrapy框架来爬取一些数据，对自己阶段性学习进行一个小小的总结本次爬取的目标数据是起点中文网中的免费作品部分，如下图：本次一共爬取了100本小说，并对爬取结果进行以下两种存储; 1.把小说内容分章节写入txt中 2.把小说的内容存入sqlserver中如下：实现的逻辑： 1.通过书的列表页获得每本书的具体url； 2.通过书籍的url获得书的章节和每个章节对应的url； 3.通过每个章节的url获取每个章节的文本内容； 4.将提取的文本进行存储，txt和sqlserver。项目代码部分：新建名为qidian的scrapy项目，新建名为xiaoshuo.py的爬虫 setting.py： BOT_NAME = 'qidian' SPIDER_MODULES = ['qidian.spiders'] NEWSPIDER_MODULE = 'qidian.spiders' LOG_LEVEL = "WARNING" # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537

爬虫实战篇---使用Scrapy框架进行汽车之家宝马图片下载爬虫

阅读更多关于爬虫实战篇---使用Scrapy框架进行汽车之家宝马图片下载爬虫

（1）、前言 Scrapy框架为文件和图片的下载专门提供了两个Item Pipeline 它们分别是： FilePipeline ImagesPipeline （2）、使用Scrapy内置的下载方法的好处 1、可以有效避免重复下载 2、方便指定下载路径 3、方便格式转换，例如可以有效的将图片转换为png 或jpg 4、方便生成缩略图 5、方便调整图片大小 6、异步下载，高效率（3）、较为传统的Scrapy框架图片下载方式 1、创建项目：scrapy startproject baoma---cd baoma --创建爬虫scrapy genspider spider car.autohome.com.cn 2、使用pycharm打开项目改写settings.py 不遵守robots协议设置请求头开启pipelines.py 改写spider.py 1 # -*- coding: utf-8 -*- 2 import scrapy 3 from ..items import BaomaItem 4 5 class SpiderSpider(scrapy.Spider): 6 name = 'spider' 7 allowed_domains = ['car.autohome.com.cn'] 8 start_urls = ['https://car.autohome.com

scrapy初试水 day02(正则提取)

阅读更多关于 scrapy初试水 day02(正则提取)

1.处理方式法一通过HtmlXPathSelector import scrapy from scrapy.selector import HtmlXPathSelector class DmozSpider(scrapy.Spider): name = "use_scrapy" #要调用的名字 allowed_domains = ["use_scrapy.com"] #分一个域 start_urls = [#所有要爬路径 "http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E5%8C%97%E4%BA%AC&kw=python&sm=0&p=1" ] #每爬完一个网页会回调parse方法 def parse(self, response): hxs=HtmlXPathSelector(response) print('_________________________') hxsobj=hxs.select('//td[@class="zwmc"]/div/a') print(hxsobj[0].select("@href").extract())#获取链接 print(hxsobj[0].select("text()").extract())#获取文本 # .extract()是显示网页的原文 print(len

Scrapy + Selenium + Datepicker

阅读更多关于 Scrapy + Selenium + Datepicker

问题 So i need to scrap a page like this for example and i am using Scrapy + Seleninum to interact with the datepicker calendar but i am running into a ElementNotVisibleException: Message: Element is not currently visible and so may not be interacted with . So far i have: def parse(self, response): self.driver.get("https://www.airbnb.pt/rooms/9315238") try: element = WebDriverWait(self.driver, 10).until( EC.presence_of_element_located((By.XPATH, "//input[@name='checkin']")) ) finally: x = self

订阅 scrapy